high performance computer architecturegiorgi/teaching/... · 13,14: bus architecture...

500
High Performance Computer Architecture Dothan Core (0.09m) 145 mm 2 /55Mtr 84 mm 2 /140Mtr 217 mm 2 m/42Mtr 143 mm 2 /291Mtr Conroe Core (0.065m) Intel-Pentium-4 (11/2000) Intel-Pentium-4 (01/2002) Intel-Pentium-M (05/2004) Intel-Core2-Duo (07/2006) Willamette Core (0.18m) Northwood Core (0.13m) Penryn Core (0.045m) 107 mm 2 /410Mtr Bloomfield Core (0.045m) 263 mm 2 /731Mtr Intel-Core-i7 (11/2008) Fermi 512G (0.040m) Sandy Bridge (0.032m) 216 mm 2 /995Mtr 467 mm 2 /3000Mtr Intel-Core2-Duo (01/2008) IBM (08/2011) Intel-Core-i7-2920XM (01/2011) BlueGene/Q – 18 cores (0.045m) NVIDIA GF100 (09/2009) 360 mm 2 /1470Mtr Llano 4C/400G (0.032m) 228 mm 2 /1450Mtr AMD A8-3850 (06/2011) 1 Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56 10 mm

Upload: others

Post on 09-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

High Performance Computer Architecture

Dothan Core(0.09m)

145 mm2/55Mtr 84 mm2/140Mtr

217 mm2m/42Mtr

143 mm2/291MtrConroe Core (0.065m)

Intel-Pentium-4 (11/2000)

Intel-Pentium-4(01/2002)

Intel-Pentium-M(05/2004)

Intel-Core2-Duo(07/2006)Willamette Core (0.18m)

Northwood Core (0.13m)

Penryn Core(0.045m)

107 mm2/410Mtr

Bloomfield Core (0.045m)263 mm2/731Mtr

Intel-Core-i7 (11/2008)

Fermi 512G (0.040m)

Sandy Bridge (0.032m) 216 mm2/995Mtr

467 mm2/3000Mtr

Intel-Core2-Duo(01/2008)

IBM (08/2011)

Intel-Core-i7-2920XM (01/2011)

BlueGene/Q – 18 cores (0.045m)NVIDIA GF100 (09/2009) 360 mm2/1470Mtr

Llano 4C/400G (0.032m) 228 mm2/1450Mtr AMD A8-3850 (06/2011)1Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

10 mm

Page 2: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

2Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

412 mm2m/174MtrFirst Commercial Dual-Core Chip (0.18m)

20 mm

661 mm2m/5560MtrXeon E5-2600 v3 (06/2015 7000$)

Haswell-EP – 18 Cores – 2SMT (0.022 m)

IBM Power4 (12/2001)

362 mm2m/2100Mtr12 Cores – 8SMT (0.022m)

IBM Power8 (12/2014)

128 mm2m/3000MtrApple-iPad Air2 (11/2014)

A8X 11 Cores (0.020m)Ivy Bridge (0.022m) tri-gate

160 mm2/1400MtrIntel-Core-i73770 (04/2012)

413 mm2m/7100MtrXeon Phi (06/2015)(estimated)

Knights Landing – 72 Cores – 4SMT (0.014 m)567 mm2m/8900Mtr

AMD Radeon R9 Fury-X (06/2015)

Fuji XT – 2048 GPU-Cores (0.028 m)

Page 3: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

3Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

Page 4: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Where are High Performance Computers ?

4Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

Among users

Where you need to make this happen:“I have a limited battery and need to… take a picture, share it with my friends, ...”

In the Internet Infrastructure

Where you need to connect anybody with anything

Every electronic devicehas a Computers inside

Electronic Devices

In the Datacenters Where you need to Storeand Retrieve YOUR data

Cars may have many as 50+ computers:(California approved a billfor autonomous vehicles)

Page 5: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

AMD Opteron 6200 ARCHITECTURE AMD Opteron 6200 CORE (“Bulldozer”)

AMD Opteron 6272HP Proliant DL 585 G7

Computer Architects• Computer Architects UNDERSTAND and CAN BUILDthe Computing Infrastructure… and almost ALL details of it ! :-)

5Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

AMD Opteron 6200 CHIP

AMD Opteron 6200 characteristics

Page 6: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Objectives of this course• This course constitutes a deeper study of current computers and aims to provide:

• Principles of high-performance microprocessors (superscalar, VLIW)

• An understanding of the basic mechanisms for the programming of applications that take advantage of the parallelism made available by the system

• Principles of Multi-Core / Multi-Processor Systems

• Tools for programming Parallel Machines

6Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

Page 7: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Course Administration• Teacher: Roberto Giorgi ( [email protected] )• Telephone: 0577-191-5182• Office-hours: Monday 16:30/19:00• Slides: http://www.dii.unisi.it/~giorgi/teaching/hpca2

• Adopted Textbook:• M. Dubois, M. Annavaram, P. Stenstrom,

"Parallel Computer Organization and Design", Cambridge University Press, 2012, ISBN: 978-0-521-88675-8

• Other Reference Textbooks• Hennessy and Patterson,

“Computer Architecture: A Quantitative Approach” 5th Ed.,Morgan Kauffman, 2012,ISBN 978-0-12-383872-8

• D. Culler, J.P. Singh, A. Gupta,"Parallel Computer Architecture: A Hw/Sw Approach",Morgan Kaufman/Elsevier, 1998, ISBN 1558603433

• M.J. Flynn, "Computer Architecture: Pipelined and Parallel Processor Design",Jones and Bartlett Publishers, Inc., 1995, ISBN 0867202041

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 567

Page 8: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Rules for exams, dates, slides, tools• Check out the course website:

8Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

http://www.dii.unisi.it/~giorgi/teaching/hpca2

Page 9: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Computer Architecture“The term ARCHITECTURE is used here to describe the set of attributes of a system, as these appears to the programmer*, i.e., its conceptual structure and its operation, with a distinctive organization of the networks that manage the flow of data and control networks, as compared to the logical design and physical implementation”

-- Gene Amdahl, IBM Journal of R&D, Apr. 1964

*programmer == system programmer (OS) engineer or the compiler

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 569

Page 10: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Architecture: an overloaded term• In the strict sense: Interface Hardware / Software

• Set of instructions• Memory management and protection• Interruptions and exceptions (traps)• Data formats (for example, IEEE 754 floating point)

• Organization: also called "Microarchitecture"• In this sense, it is "the implementation" of architecture

(this is a part that Gene Amdahl had excluded)• Specifies the functional units and connections• Configuration of the pipeline• Position and configuration of cache memory

• As a discipline, "Architecture of Computers" also includes the microarchitecture• To avoid confusion when it comes to interface HW / SW we use "Instruction Set

Architecture" (ISA)• "COMPUTER ARCHITECTURE concerns the interface between what the technology provides and what the market demands" - Yale Patt, ISCA, Jun 2006

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5610

Page 11: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Levels of Computer Architecture

I/O devicesand

Networking

Controllers

System Interconnect(bus)

Controllers

MemoryTranslation

Execution Hardware

Drivers MemoryManager Scheduler

Operating System

Libraries

ApplicationPrograms

MainMemory

1

2

33

4 5 6

7 78888

9

10 10

1111 12

13 14

ISA

Software

Hardware

1: User Interface2: API3,7: ABI4,5,6: internal interface of

the Operating System7,8: ISA9: Memory architecture10: I/O architecture11,12: RTL architecture13,14: Bus architecture

API=Application Program InterfaceABI=Application Binary InterfaceISA=Instruction Set ArchitectureRTL=Register Transfer Level

Interfaces:

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5611

Page 12: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

What technology provides: Moore's Law• “The number of TRANSISTORS doubles every 18 months”

(Later revised to "24 months"), this is due to:- higher density (transistors / area)- availability of bigger chips

DATA FROM SLIDE 1

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5612Moore's Law and purely PSYCHOLOGICAL!

Mtr

Page 13: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

What the market demands: Applications• Application Trend

• FROM numerical, scientific TO commercial, entertainment• FROM few "big" TO ubiquitous, "small“

- mainframes minis microprocessors handheld, embedded• FROM little TO big memory storage (primary and secondary)• FROM single-thread TO multiple-threads• FROM standalone TO networked (cloud computing)• FROM character-oriented TO multimedia (graphics and sound)• FROM personal data TO “BIG DATA”

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5613

Page 14: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Main Applications• Numerical/Scientific

• Computational Fluid Dynamics, Weather Prediction, ECAD• Long word length, floating point arithmetic

• Commercial• inventory control, billing, payroll, decision support• byte oriented, fixed point, high I/O, large secondary storage

• Real-Time/Embedded• control, some communications• predictable performance• interrupt architecture important, low power, cost critical

• Home Computing• multimedia, entertainment• high bandwidth data movement, graphics• cryptography, compression/decompression

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5614

Page 15: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

App. Trends: Multimedia, Networked, Web-servers• A large choice of multimedia devices with

• Graphic displays (LCD, etc.).• High Definition Audio• Large capacity of secondary storage for images, sound, etc…

• Services via the Web and high-performance networks require• Many independent threads• Wide band communication

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5615

Page 16: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

MICROPROCESSOR ARCHITECTURE• The increasing number of transistors (cheaper and faster)has fueled the demand for higher performance CPU

• 1970s – Serial CPU, 1-bit for integers

• 1980s – 32-bit RISC with a pipeline- The ISA simplicity allows the integration

of the entire processor chip

• 1990s – bigger CPUs, superscalar- Also for CISC

• 2000s – Multiprocessors on a chip...

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5616

Page 17: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Course Structure

1. High Performance Pipelining 2. Branch Prediction3. Superscalar processor4. Media Processing: VLIW processors5. Multiprocessors and related problems6. TLP: Thread Level Parallelism7. Evaluation of High Performance Architectures8. Tools for Parallel programming machines

(Cilk, OpenMP, MPI, CUDA, ...)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5617

Page 18: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

EVALUATING COMPUTERS

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5618

Page 19: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

POWER

• TOTAL POWER: DYNAMIC + STATIC(LEAKAGE)

Pdynamic = αCV2f

Pstatic = VIsub ≈ Ve-KVt/T

• DYNAMIC POWER FAVORS PARALLEL PROCESSING OVER HIGHER CLOCK RATE• DYNAMIC POWER ROUGHLY PROPORTIONAL TO f3

• TAKE A CORE AND REPLICATE IT 4 TIMES: 4X SPEEDUP & 4X POWER• TAKE A CORE AND CLOCK IT 4 TIMES FASTER: 4X SPEEDUP BUT 64X DYNAMIC POWER!

• STATIC POWER • BECAUSE CIRCUITS LEAK WHATEVER THE FREQUENCY IS.

• POWER/ENERGY ARE CRITICAL PROBLEMS• POWER (IMMEDIATE ENERGY DISSIPATION) MUST BE DISSIPATED

• OTHERWISE TEMPERATURE GOES UP (AFFECTS PERFORMANCE, CORRECTNESS AND MAY POSSIBLY DESTROY THE CIRCUIT, SHORT TERM OR LONG TERM)

• EFFECT ON THE SUPPLY OF POWER TO THE CHIP

• ENERGY (DEPENDS ON POWER AND SPEED)• COSTLY; GLOBAL PROBLEM• BATTERY OPERATED DEVICES

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5619

Page 20: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

RELIABILITY

• TRANSIENT FAILURES (OR SOFT ERRORS)• CHARGE Q = C X V

• IF C AND V DECREASE THEN IT IS EASIER TO FLIP A BIT• SOURCES ARE COSMIC RAYS AND ALPHA PARTICLES RADIATING

FROM THE PACKAGING MATERIAL• DEVICE IS STILL OPERATIONAL BUT VALUE HAS BEEN CORRUPTED• SHOULD DETECT/CORRECT AND CONTINUE EXECUTION • ALSO: ELECTRICAL NOISE CAUSES SIMILAR FAILURES

• INTERMITTENT/TEMPORARY FAILURES• LAST LONGER• DUE TO

• TEMPORARY: ENVIRONMENTAL VARIATIONS (EG, TEMPERATURE)• INTERMITTENT: AGING

• SHOULD TRY TO CONTINUE EXECUTION• PERMANENT FAILURES

• MEANS THAT THE DEVICE WILL NEVER FUNCTION AGAIN• MUST BE ISOLATED AND REPLACED BY SPARE

PROCESS VARIATIONS INCREASE THE PROBABILITY OF FAILURES

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5620

Page 21: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

PERFORMANCE METRICS (MEASURE)

• METRIC #1: TIME TO COMPLETE A TASK (Texe): EXECUTION TIME, RESPONSE TIME, LATENCY• “X IS N TIMES FASTER THAT Y” MEANS Texe(Y)/Texe(X) = N• THE MAJOR METRIC USED IN THIS COURSE

• METRIC #2: NUMBER OF TASKS PER DAY, HOUR, SEC, NS• THE THROUGHPUT FOR X IS N TIMES HIGHER THAN Y IF

THROUGHPUT(X)/THROUGHPUT(Y) = N• NOT THE SAME AS LATENCY (EXAMPLE OF MULTIPROCESSORS)

• EXAMPLES OF UNRELIABLE METRICS:• MIPS: MILLION OF INSTRUCTIONS PER SECOND• MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

PER SECOND

EXECUTION TIME OF A PROGRAM IS THE ULTIMATE MEASURE OF PERFORMANCE BENCHMARKING

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5621

Page 22: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

WHICH PROGRAM TO CHOOSE?

• REAL PROGRAMS: • PORTING PROBLEM; COMPLEXITY; NOT EASY TO UNDERSTAND THE CAUSE OF

RESULTS

• KERNELS• COMPUTATIONALLY INTENSE PIECE OF REAL PROGRAM

• TOY BENCHMARKS (E.G. QUICKSORT, MATRIX MULTIPLY)

• SYNTHETIC BENCHMARKS (NOT REAL)

• BENCHMARK SUITES• SPEC: STANDARD PERFORMANCE EVALUATION CORPORATION

• SCIENTIFIC/ENGINEEING/GENERAL PURPOSE• INTEGER AND FLOATING POINT• NEW SET EVERY SO MANY YEARS (95,98,2000,2006)

• TPC BENCHMARKS: • FOR COMMERCIAL SYSTEMS• TPC-B, TPC-C, TPC-H, AND TPC-W

• EMBEDDED BENCHMARKS• MEDIA BENCHMARKS

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5622

Page 23: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

REPORTING PERFORMANCE FOR A SET OF PROGRAMS

LET Ti BE THE EXECUTION TIME OF PROGRAM i (out of N progams):1. (WEIGHTED) ARITHMETIC MEAN OF EXECUTION TIMES:

OR

THE PROBLEM HERE IS THAT THE PROGRAMS WITH LONGEST EXECUTION TIMES DOMINATE THE RESULT

2. DEALING WITH SPEEDUPS• SPEEDUP MEASURES THE ADVANTAGE OF A MACHINE OVER A REFERENCE

MACHINE FOR A PROGRAM i (let TR,i be the execution time on the reference machine)

• ARITHMETIC MEAN OF SPEEDUPS

• HARMONIC MEAN

T i Ni

T i W ii

SiTR iTi

-----------=

= 1= ∑ 1

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5623

Page 24: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

REPORTING PERFORMANCE FOR A SET OF PROGRAMS

• GEOMETRIC MEANS OF SPEEDUPS

- MEAN SPEEDUP COMPARIONS BETWEEN TWO MACHINES ARE INDEPENDENT OF THE REFERENCE MACHINE

- EASILY COMPOSABLE- USED TO REPORT SPEC NUMBERS FOR INTEGER AND FLOATING POINT

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5624

=

Page 25: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example1 – Quantative comparison depends on the reference machine

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 56

Program A Program B Arithmetic Mean Speedup (ref 1) Speedup (ref 2)Machine 1 10 sec 100 sec 55 sec 91.8 10Machine 2 1 sec 200 sec 100.5 sec 50.2 5.5Reference 1 100 sec 10000 sec 5050 secReference 2 100 sec 1000 sec 550 sec

25

Page 26: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example 2 – contrasting results with Arithmetic and harmonic mean

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5626

Program A Program BMachine 1 10 sec 100 secMachine 2 1 sec 200 secReference 1 100 sec 10000 secReference 2 100 sec 1000 sec

Program A Program B Arithmetic Harmonic GeometricWrt Reference 1

Machine 1 10 100 55 18.2 31.6Machine 2 100 50 75 66.7 70.7

Wrt Reference 2

Machine 1 10 10 10 10 10Machine 2 100 5 52.5 9.5 22.4

In terms of speedup:

GM: whichever reference machine we choose, the relative speed between the two machines is always the SAME !!

Page 27: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

FUNDAMENTAL PERFORMANCE EQUATIONS FOR CPUs(also known as “IRON LAW”)

Texe = IC X CPI X Tc

• IC: DEPENDS ON PROGRAM, COMPILER AND ISA• CPI: DEPENDS ON INSTRUCTION MIX, ISA, AND

IMPLEMENTATION• Tc: DEPENDS ON IMPLEMENTATION COMPLEXITY AND

TECHNOLOGY

CPI (CLOCK PER INSTRUCTION) IS OFTEN USED INSTEAD OF EXECUTION TIME

• WHEN PROCESSOR EXECUTES MORE THAN ONE INSTRUCTION PER CLOCK USE IPC (INSTRUCTIONS PER CLOCK)

Texe = (IC X Tc)/IPC

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5627

Page 28: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

AMDAHL’S LAW

• ENHANCEMENT E ACCELERATES A FRACTION F OF THE TASK BY A FACTOR S

1-F F

Apply enhancement

1-F F/S

without E

with E

Texe withE Texe withoutE X 1 F– FS--+=

Speedup E Texe withoutE Texe withE -------------------------- 1

1 F– FS--+

---------------= =

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5628

Page 29: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

LESSONS FROM AMDAHL’S LAW

1) IMPROVEMENT IS LIMITED BY THE FRACTION OF THE EXECUTION TIME THAT CANNOT BE ENHANCED

• LAW OF DIMINISHING RETURNS – MARGINAL SPEEDUP• The difference between SPEEDUPk+1 and SPEEDUPk is smaller and smaller as S goes from k to k+1

2) OPTIMIZE THE COMMON CASE• EXECUTE THE RARE CASE IN SOFTWARE (E.G. EXCEPTIONS)

F=0.5

SPEEDUP E 11 F–-------<

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5629

Amdhal’s maximum

Amdhal’s Law

Remaining Speedup

Marginal Speedup

Page 30: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

PARALLEL SPEEDUP

• NOTE: SPEEDUP CAN BE SUPERLINEAR. HOW CAN THAT BE??

OVERALL NOT VERY HOPEFUL

F=0.95

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5630

= = 11 − + / = + 1 − < 11 −“mortar shot”

Amdhal’s Law

Amdhal’s maximum

Ideal speedup

Page 31: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

GUSTAFSON’S LAW

• REDEFINE SPEEDUP• THE RATIONALE IS THAT, AS MORE AND MORE CORES ARE INTEGRATED ON

CHIP OVER TIME, THE WORKLOADS ARE ALSO GROWING• STARTS WITH THE EXECUTION TIME ON THE PARALLEL MACHINE WITH P

PROCESSORS:

• s IS THE TIME TAKEN BY THE SERIAL CODE AND p IS THE TIME TAKEN BY THE PARALLEL CODE

• EXECUTION TIME ON ONE PROCESSOR IS

• Let F=p/(s+p). Then SP = (s+pP)/(s+p) = (s+p–p+pP)/(s+p)=1-F+FP = 1+F(P-1)

TP s p+=

T1 s pP+=

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5631

Gustafson observes that even if the single algorithm/program completes faster only if the parallel portion is dominant (Amdhal), the same algorithm will complete more and more faster as we add processors (P) compared to a purely sequential execution that just repeats the parallel portion (p) for P times.

F

Sp

Page 32: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Course Structure

1. High Performance Pipelining 2. Branch Prediction3. Superscalar processor4. Media Processing: VLIW processors5. Multiprocessors and related problems6. TLP: Thread Level Parallelism7. Evaluation of High Performance Architectures8. Tools for Parallel programming machines

(Cilk, OpenMP, MPI, CUDA, ...)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5632

Page 33: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

PIPELINING

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5633

Page 34: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Pipelining

• Pipelining principles• Simple Pipeline• Structural Hazards• Data Hazards• Control Hazards

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5634

Page 35: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Pipelining principles

• Let T be the time to execute an instruction• Without pipelining

• Latency = T• Throughput seq = 1 / T

• With an ideal n-stage pipeline• Latency = T• Throughput pipe = n / T

• Speedup = Throughput pipe /Throughput seq = n

T

1 2 n. . .

1 2 n. . .

1 n. . .1 2 n. . .

The (ideal) speedup obtainable from an ideal pipeline is equal to n

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5635

• Consider instructions composed of n phases of equal duration

2

Page 36: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Implementation of a Simple Pipeline• Simple 5-stage pipeline

• F -- Instruction Fetch• D -- Instruction Decode + Operand Fetch• X -- Execution and Effective Address • M -- Memory Access• W – Write-back Results

latch

clock

latchlatchlatchlatchlatch

F D X M W

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5636

Page 37: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

5-STAGE PIPELINE

INSTRUCTIONS GO THROUGH EVERY STAGE IN PROCESS ORDER, EVEN IF THEY DON’T USE THE STAGE• NOTE: CONTROL IMPLEMENTATION

• INSTRUCTION CARRIES CONTROL• THIS IS A GENERAL APPROACH: “INSTRUCTION CARRIES ITS BAGGAGE”

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5637

Page 38: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Notation

5-stage pipeline1 2 3 4 5 6 7 8 9

i F D X M Wi+1 F D X M Wi+2 F D X M Wi+3 F D X M Wi+4 F D X M W

accessexecute backwrite

M WXF Dmemory

inst. fetchinst.decode

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5638

Page 39: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Pipeline Hazards• Conditions that lead to a malfunction if certain countermeasures are not taken

1) Structural Hazards• Two instructions want to use the same hardware resource in the same

cycle (conflict over resources, e.g. Instruction Mem. and Data Mem.)2) Data Hazards

• Two instructions use the same data: must happen in the order defined by the programmer, even if the execution overlaps parts of the instruction execution (see RAW, WAW, WAR)

3) Control Hazards• An instruction (branch, jump, call) can irrevocably determine which

instructions are executed next, because the pipeline has already taken instructions from the initial branch even if there is a jump

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5639

Page 40: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

1) Structural Hazards• Two instructions want to use the same hardware resource in the same cycle Example:• A load / store uses the same memory location that is used by the

instruction fetch

i F D X M W <-- load instructioni+1 F D X M W i+2 F D X M W i+3 * F D X M W <-- i-fetch stallsi+4 F D X M . . .

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5640

Page 41: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Resolving structural hazards• Stall one of the involved instructions

+ Cost-effective and simple- Reduces the performance- Used for some rare events

• Pipelining the resource• Useful if possible (e.g.,

for resources that require more cycles)+ Good performance- In some cases too complex to do

(e.g. RAM)• Replicate the resource

+ Good performance- Costly- Probably introduces delays- Used for cheap resources (or indivisible) De-mux Mux

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5641

Page 42: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Guidelines to reduce the structural hazards• The structural hazards can be avoided if each instruction uses the resource:• At most once:

- (e.g., Separated Instruction memory and Data memory)• In the same pipeline cycle

- (e.g. I-Fetch in stage F, R/W in stage M)• For a single cycle

- (e.g. HIT in the data or instruction cache)

• Many RISC processor ISAs were designed with this in mind• Example of problematic situation:

• MISS in cache: pipeline stalls

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5642

Page 43: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

2) Data Hazards• Two instructions use the same data: this must happen in the order indicated by the programmer, even if the execution overlaps parts of the instruction execution. Example :

R1 <- R2 + R3R2 <- R1 - R7R1 <- R5 OR R6

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5643

Page 44: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Reading r1

Writing r1

Data Hazards -- examples

i add r1, r2, r3 F D X M Wi+1 sub r2, r1, r7 F D X M W

r1 ?? Read-After-Write (RAW) Hazard

r1 ?? Write-After-Read (WAR) Hazard

i+1 sub r2, r1, r7 F * * * * * D X M Wi+2 or r1, r5, r6 F D X M W

Writing r1

Reading r1

Note: PURELY HYPOTHETICAL SITUATION it can not happen in this pipeline, by construction

i add r1, r2, r3 F * * * D X M Wi+1 sub r2, r1, r7 F D X M W i+2 or r1, r5, r6 F D X M W

Writing r1

Writing r1r1 ?? Write-After-Write (WAW) Hazard

Note: PURELY HYPOTHETICAL SITUATION it can not happen in this pipeline, by construction

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5644

Page 45: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Dependency and HazardDependency: situation in the code that can potentially create hazards

• Read-After-Write (RAW, true-dependence)• There is a real “data exchange" from an instruction to another

• Write-After-Read (WAR, anti-dependence)• An artificial dependence that comes from a bad assignment of registers

• Write-After-Write (WAW, output-dependence)• An artificial dependence that comes from a bad assignment of registers

• Read-After-Read (RAR)• Will not cause problems

HAZARDS• The dependencies can be translated into hazards depending on the hardware

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5645

Page 46: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

True Dependence and MIPS Data Hazards• Read After Write (RAW) The instruction J tries to read an operand before the instruction I writes it

• Caused by a “dependence” (in the terminology of the theory of compilers) called “true dependence"• The hazard results from a real need of data communication

• In the MIPS processor True Dependence normally generates a hazard

I: add r1,r2,r3J: sub r4,r1,r3

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5646

Page 47: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Anti-Dependence e MIPS Data Hazards• Write After Read (WAR) The instruction J tries to write an operand before the instruction I reads it

• Also called "anti-dependence" (in the terminology of the theory of compilers)

• It results from having reused the name "r1", while I could easily use another register (*)

• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and• the reads from registers occur in stage 2 (D) and• all writes are always in stage 5 (W)

I: sub r4,r1,r3 J: add r1,r2,r3

(*) This, however, is not always possible to sw, v. Lesson-2Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5647

Page 48: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Output Dependence e MIPS Data Hazards• Write After Write (WAW) The instruction J tries to write an operand before the instruction I writes it

• Also called "output dependence" (in the terminology of the theory of compilers)

• Also in this case, it results from having reused the name "r1", while I could easily use another register (*)

• Does not conflict in case of a 5-stage pipeline MIPS because:• All instructions take 5 stages, and• all writes are always in stage 5 (W)

I: mul r1,r4,r3 J: add r1,r2,r3

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5648 (*) This, however, is not always possible to sw, v. Lesson2

Page 49: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Simple resolution of the RAW hazard

• The hardware detects the RAW hazard, and then...• Generates a stall to allow the "producer" instruction to finish

F D X M W R1<-R2+R3 F D X M WR2<-R1-R7 F * * D X M W

+ Cost-effective and simple- Reduces the performance

NOTE: It is assumed that the registers can be written in the first half of the cycle (W) and read in the second half of the cycle (D)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5649

Page 50: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Implementation: stall control network• Add latches to remember the RS1/RS2/RD register identifiers at each stage

• The stall is detected by making the following comparisonif (RS1(D)==RD(X) || RS1(D) == RD(M)) then STALL (generates stall in F)• Similarly for RS2

unitExecution

fileRegister

D-CacheA

B

RS1

RD RD

StallControl

stall

Stage D Stage X Stage M

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5650

Page 51: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Inserting Stalls (detail)• Related to the instruction that creates the stall, it is necessary:

• On the previous stages: block all "inter-stage latch"• On the next stages:

- Turn off the valid bit associated with inter-stage latch, so that the "bubble" in the pipeline can continue to proceed without creating problems

Previous stageStage of the instructionwhich STALLS Next Stage Next Stage

V VValid Bit = 0 Valid Bit = 1To Flip-Flop HOLD signal

STALL SIGNAL

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5651

Page 52: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Reduction of the RAW stalls• Bypass/Forward/Short-Circuit network

• Idea: use the data before it is written in the registers+ Reduces (potentially avoid) stalls- Additional Complexity

bypasses

ME WBEXIF ID

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5652

Page 53: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Bypass• Additional Hardware

• Multiplexer to select the input value to the ALU:or from Registers or from Bypass network

• Hazard detection logic (called interlock) that controls these multiplexers

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5653

bypass control

Unit(ALU)

Execution

operandlatches

bypass control

bypass

MUX

MUX

fileRegister

Resultlatch

Page 54: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Network to detect the possibility of Bypass• Add latches to remember RS1/RS2/RD register names at each stage (similar to the network for hazard detection)

• E.g. on the input A of the ALU, the network will act like this:if RS1(D)==RD(X) then select ALU-OUT(X)else if RS1(D)==RD(M) then D-CACHE-OUT(M)else select (A)

…similarlyon B input…

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5654

Unit(ALU)

ExecutionMUX

MUX

fileRegister

Resultlatch

D-Cache

RS1

RD RD

BypassControl

A

B

ALU-OUT(X)

D-CACHE-OUT(M)

Page 55: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Interaction between control networks Stall/Bypass• The stall logic is aware of the presence of the bypass logic• The bypass logic is activated independently at each stall condition

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5655

unitExecution

MUX

MUX

fileRegister

D-Cache

RS1

RD RD

BypassControl

A

B

StallControl

Page 56: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Pipeline Scheduling• Scheduling of instructions at compile-time (Reorder

instructions to reduce stalls caused by instruction load):

BEFORE:a= b + c; R1 <- mem(b)

R2 <- mem(c)stall

R3 <- R1 + R2mem(a) <- R3

d = e - f; R4 <- mem(e)R5 <- mem(f)

stallR6 <- R4 - R5mem(d) <- R6

AFTER:R1 <- mem(b)R2 <- mem(c)R4 <- mem(e)R3 <- R1 + R2R5 <- mem(f)mem(a) <- R3R6 <- R4 - R5mem(d) <- R6

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ01-SL di 5656

Page 57: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Dynamic Instruction Scheduling(Introduction)

High Performance Computer Architecture

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-1

Page 58: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example: limits of the sequential execution• Multiplication of the elements of two vectors, store the result in a third vector

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-2

r1 <- looplengthr2 <- 0r4 <- addr(b) # loadr5 <- addr(c) # the pointersr6 <- addr(a) # to the variables

loop: r3 <- mem(r4+r2) # load b(i)r7 <- mem(r5+r2) # load c(i)r7 <- r7 * r3 # b(i) * c(i)r1 <- r1 - 1 # decr. the countermem(r6+r2)<- r7 # store a(i)r2 <- r2 + 8 # update the indexP <- loop; r1!=0 # close the loop

Page 59: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

“In-Order” execution – standard pipeline The situation:

The Problem:• The Load is launched (“issue” phase)• The Multiply stalls because of the true dependency (on r7)• The Subtract stalls becasue the Multiply stalls

(and also the next Branch stalls)- Why the Subtract should (needlessly) stall?

• The In-order execution limits the performance

.

.r7 <- mem(r5+r2) # load c(i)r7 <- r7 * r3 # b(i) * c(i)r1 <- r1 - 1 # decr. Counter..

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-3

Page 60: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Possible solutions• Static Scheduling (Software)

• The compiler re-orders the instructions+ it can implement “tricks” much more powerful than the (high-level) programmer may know+ it can make use of simpler (and therefore faster) hardware - it requires additional work to the compiler- it may does not adapt to events (e.g., cache miss) that occur at runtime• Adopted in the Intel Itanium and VLIW processors in general

• Dynamic Scheduling (Hardware) or “Out-Of-Order Issue”• The hardware re-orders the instructions+ it can handle events not known at compile-time+ the software is less dependent on specific hardware (portability)- the hardware is certainly more complex• Adopted in almost all superscalar processors

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-4

Page 61: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Dynamic Scheduling or “Out-Of-Order“ Issue• RULES for proper operation:

RU1) The out-of-order issue must respect “True dependences”,i.e., should not generate to RAW hazards RU2) The out-of-order issue must avoid “False dependences”,i.e., should not generate (avoidable) WAR and WAW hazards

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-5

tIN-ORDER issue (and execution)

I1 I2 I3

t OUT-OF-ORDER issue (and execution)I1 I3 I2

(in this case resulting in a WRONG execution)

A code snippet with false dependecies:r7 <- mem(r5+r2) # first instruction (I1)r7 <- r7 * r3 # second instrucion (I2)r3 <- r6 + 1 # third instruction (I3)

if r3 is changed (out-of-order) by the third instruction (I3), the multiplication (I2) would read an incorrect value from r3

Page 62: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Resolving False Dependencies by SW ?

• The compiler can avoid some false dependenciesBUT NOT ALL (some are desired). E.g.:

r1 <- xif y !=0 r1 <- zr3 <- r1 + r2

• When the “y!=0” condition is true, there is an output dependence on r1 (then a true dependence with the next instruction)

• There is no “false dependence” when the branch is not taken

• An instruction may also create dependencies with itself- (e.g. it happens in the dynamic scheduling of instructions in loops)

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-6

Page 63: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Resolving False Dependencies by HW !

• We can effectively eliminate false dependencies through renaming the registers by hardware

• The register identifiers have than a one-to-one association with the processed values• NOT with physical locations

• The hardware can rename registers to remove false dependencies. E.g. (same example as in slide 5):

r7 <- mem(r5+r2) t1 <- mem(r5+r2)r7 <- r7 * r3 t2 <- t1 * r3r3 <- r6 + 1 t3 <- r6 + 1

t1 and t2 are the new names of r7

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-7

Page 64: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

A first approach: Thornton's Scoreboard• Implemented in the CDC 6600 (1964)• The CDC 6600 had 18 non-pipelined

functional units• 4 FP units: 2 multiply, 1 add, 1 divide• 7 Memory units: 5 load, 2 store• 7 Integer units: 3 add, 1 shift, 1 logical, ...

• Scoreboard: a centralized control scheme• Check the launch of

any instructions• Detects hazards

• Realizes the '"Out-Of-Order issue“ but does not use the "renaming" of the registers

• The critical WAR, WAW may stall the instruction issue

• Mainly of historical interest

Instruction"stack" Decode

Boolean

Shift

FixedAdd

FloatAdd

FloatMult.

FloatMult.

FloatDivide

ShortFixedAdd

ShortFixedAdd

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-8

Page 65: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

IBM 360/91 and Tomasulo’s Algorithm• "Fast" version of the IBM 360 for scientific programs

• The chief architect was Gene Amdahl• Announced in 1964 and available since 1965

• Tomasulo’s Algorithm• Published in 1967 (before the introduction of cache memory

concept)• The Floating Point units are pipelined and include:

- Adder- Multiplier (the Division is executed in the Multiplier)

• Dynamic Scheduling in the FP unit• Uses an algorithm known to history as “Tomasulo’s Algorithm”• First used in the IBM 360/91

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-9

Page 66: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Generalized Tomasulo’s Algorithm• Improves the architecture of a parallel pipeline(i.e., a pipeline with functional units that operate in parallel)

• Extends the Tomasulo’s Algorithm to all functional units- Not only to the Floating Point units- Moreover: it manages the loads/stores as any other instruction

• Introduces the concept of “Reservation Station” (RS)• Each Functional Unit (FU) has a set of associated RSs• The RS hold information about available operands and those one that

are not yet available because they are produced later in time in the pipeline processing

• The RS account for “future operands” by associating a “TAG” which exactly tells which is the functional unit that will produce the future value

• Once the result is produced by the given FU then it is broadcast on a Common Data Bus (CDB) along with the TAG of the FU

• A matching logic associates the CDB value+tag with the waiting RSs-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-10

Page 67: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Generalized Tomasulo’s Algorithm -- Scheme

Com

mon D

ata Bus

(I-cacheAccess)

NIP CIP

(Decode)

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

(DISPATCH) (ISSUE)

LQ

SQ

M RS

LS RS

A RS

(Complete)

2 RSM

3 RSLS

3 RSA

3 ELEMLQ

3 ELEMSQ

NIP=Next Instruction to disPatch, CIP=Current Instruction to disPatch

F D P I CX1 X2 X3 X4

M FU

LS FU

A FU

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-11

Page 68: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Structure of a Reservation Station (RS)

Busy The element is not availableOp OpcodeQj, Qk Source FU designators

associated with this instruction at dispatch time* if ZERO the corresponding Vx is the current value* if !=ZERO the Qx indicates the FU that WILL produce the value

Vj, Vk Operand ValueI Immediate value (if specified by the instruction)

• The Register File fields, in addition to the value Vi, also include the tag Qi that may indicate the RS that will produce such value (when Qi !=0):

Busy Qj IVKQk Vj

1 bit e.g. 6 bits 4 bits 4 bits 32 bits

Op

32 bits 16 bit

RSx:

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19

Ri: Qi Vi

Note-1: only one between Vx or Qx is valid at a given timeNote-2: For loads, Vk contains an immediate value (e.g., an offset)

Qi Tag fieldVi Operand value

-12

Page 69: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Generalized Tomasulo’s Algorithm -- Dispatch• 3 fundamental steps (**)

• Dispatch (book: “Issue”)• Issue (book: “Execute”)• Complete (book: “Write”)

• disPatch• Take the next instruction from the

Decode unit• Locate the first free appropriate RS• If no appropriate and free RS structural hazard• If appropriate and free RS Dispatch the instruction• Copy the operands (1 or 2, if ready) from the registers to the RS• If one or both operands are not ready, instead of values , write in the

RS tag values of RS identifier of the producer instruction• This corresponds to resolve false dependencies by “renaming" the

registers with the tags (RULE RU2, see slide 5)

Comm

on Data Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

** NOTE: We use here the terminology "more modern” than in the Hennessy-Patterson book

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-13

Page 70: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Generalized Tomasulo’s Algorithm -- Issue• Issue

• If one of the RS associated with a FU has the operands ready, then the associated instruction it is launched (issued) and later executed in the stage(s) that follow(s)

• If the operands are not yet available, it means that they will become available on the CDB later on

• In this way, the execution is delayed until the arrival of the operands, automatically solving the true dependencies (RULE RU1, see slide 5)

Comm

on Data Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-14

Page 71: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Generalized Tomasulo’s Algorithm -- Complete• Complete

• If the CDB is available, the FU result will be written on the CBD, along with the identifier ('Id') of the producer RS

• The RSs and registers act as an associative memory with respect to the tag: all RSs that contain a tag equal to Id, will store the data seen on the CDB

• The registers are updated too in this phase, with the same mechanism

• If the CDB is not available, we have a stall (for structural hazard)

Com

mon Data Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-15

Page 72: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Generalized Tomasulo’s Algorithm -- Commit• Instructions can now complete out-of-order• The machine does not need a “commit stage” but we can think about a “safe state” of the machine like if it was a subsequent “commit stage” after the “complete stage”

• This “safe state” is useful to know at which cycle we can consider that the machine has completed **in-order** all previously issued instructions

• See some exercise like 23/06/2005 test to better understand this point

-16-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19

Page 73: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Generalized Tomasulo’s Algorithm – the RSs• Reservation Stations

• Implement a distributed hazard control- in the dispatch phase, the tags realize the renaming and avoid WAW and WAR- in the issue phase, the tag matching on the CDB realize the necessary delay

for a previous data thus eliminating RAW (key observation: we delay only the single instruction, NOT the whole pipeline ! )

• In our scheme: the RSs are numbered from 1 to 8- The ‘zero’ is a reserved tag value to indicate that the address is not from a

RS but it’s ready in the ‘operand field’- The tag takes a 4-bits- All the "receivers" use the 'Id' field to identify the incoming data on the CDB

• During Dispatch- The tags are created and associated with the produced values (initially not

yet available), i.e. the substitute the registers identifiers

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-17

Page 74: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Loads and Stores• The Load is managed like the other instructions

• They are processed by the Load / Store FU• The address passes through the Load Queue (LQ)

• The EFFECTIVE address calculation phase of store instructions is separated from the writing-into-memory phase• The address A is sent to a queue associated to the Store FU (called Store

Queue or SQ), in the Dispatch phase• The data to be written is sent directly into the SQ• The Load/Store FU then sends the EFFECTIVE address to the SQ• When the data arrives, it is then associated to the effective address

within the waiting instruction that is in the SQ -- through the tag field• Resolving data hazards that happen through the memory locations

• The load and store must be completed "in-order"• A load address must be compared with all previous addresses in the SQ

- If there is a match, you must wait for the completion of the corresponding store (stall)- If there is NO MATCH, then the load can "go ahead" in SQ and complete out-of-order

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-18

Page 75: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Loads and Stores• Store Queue (SQ)

Ai Address valueQi Data Tag fieldVi Data value

• Load Queue (LQ)Ai Address value

XX

to memory

hazard control

Load Queue

compare

address

issueinstruction

Store Queue

load addresses

store addresses

& translation

addaddress

store datafrom CDB

-Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ02-SL di 19-19

SQEi: Qi ViAi

LQEi: Ai

Page 76: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

http://www.dii.unisi.it/~giorgi/teaching/hpca2

Dynamic Instruction Scheduling(Example)

High Performance Computer Architecture

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-1

Page 77: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example

-2-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18

loop: r3 <- mem(r4+r2) # load b(i)r7 <- mem(r5+r2) # load c(i)r7 <- r7 * r3 # b(i) * c(i)r1 <- r1 - 1 # decr. Countermem(r6+r2)<- r7 # store a(i)r2 <- r2 + 8 # bump indexP <- loop; r1!=0 # close loop

Page 78: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Com

mon D

ata Bus

Reg Q V01 0 1002 0 03 64 0 10005 0 20006 0 30007 0 49

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-3

RS Id Busy Op Vj Vk Qj QkA1 1A2 2A3 3M1 4M2 5LS1 6 1 load 1000 0 0 0LS2 7LS3 8

CYCLE 1

Page 79: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Com

mon D

ata Bus

Reg Q V01 0 1002 0 03 64 0 10005 0 20006 0 30007 7

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-4

RS Id Busy Op Vj Vk Qj QkA1 1A2 2A3 3M1 4M2 5LS1 6 1 load 1000 0 0 0LS2 7 1 load 2000 0 0 0LS3 8

CYCLE 2

Page 80: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Com

mon D

ata Bus

Reg Q V01 0 1002 0 03 64 0 10005 0 20006 0 30007

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-5

RS Id Busy Op Vj Vk Qj QkA1 1A2 2A3 3M1 4 1 mult 6 7M2 5LS1 6 1 load 1000 0 0 0LS2 7 1 load 2000 0 0 0LS3 8

7 - The first load complets: let’s assume that reads ’13’

CYCLE 3

4

Page 81: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Com

mon D

ata Bus

Reg Q V01 12 0 03 0 134 0 10005 0 20006 0 30007 4

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-6

RS Id Busy Op Vj Vk Qj QkA1 1 1 sub 100 1 0 0A2 2A3 3M1 4 1 mult 13 0 7M2 5LS1 6 0LS2 7 1 load 2000 0 0 0LS3 8

- The first load writes on the CDB (the value 13)- The sub goes in dispatch- The second load is issued- The mult can’t be issued until it gets Qi=Qk=0

CYCLE 4

Page 82: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Reg Q V01 12 0 03 0 134 0 10005 0 20006 0 30007 4

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-7

RS Id Busy Op Vj Vk Qj QkA1 1 1 sub 100 1 0 0A2 2A3 3M1 4 1 mult 13 0 7M2 5LS1 6 1 sto 3000 0 0 0LS2 7 1 load 2000 0 0 0LS3 8

- The second load complets and let’s assume it reads ’11’- The mult waits and the sub is issued- The store goes in dispatch

-Simultaneously we allocate one element in the SQ-The sub is going to conflict on the CDB with the load, then will have to wait

CYCLE 5

SQ: A Q V

4

conflicton the CDB

Page 83: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 12 23 0 134 0 10005 0 20006 0 30007 4

LQ A

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-8

RS Id Busy Op Vj Vk Qj QkA1 1 1 sub 100 1 0 0 A2 2 1 add 0 8 0 0A3 3M1 4 1 mult 13 11 0 0M2 5LS1 6 1 sto 3000 0 0 0LS2 7 0 LS3 8

- The second load writes on the CDB (the value 11)- The mult is issued, and the sub is waiting the CDB- The store is issued: in the SQ it gets the effective address A

- but it can’t advance, until Qi != 0- The add goes in dispatch

CYCLE 6

SQ: A Q V

3000 4

Page 84: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 0 992 23 0 134 0 10005 0 20006 0 30007 4

LQ A

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-9

RS Id Busy Op Vj Vk Qj QkA1 1 0 A2 2 1 add 0 8 0 0A3 3 1 brch 99 0 0 0M1 4 1 mult 13 11 0 0M2 5LS1 6 1 sto 3000 0 0 0LS2 7 0 LS3 8

- The mult proceeds and the store waits- The sub complets and updates R1 (and the CDB) with ’99’- The add is issued- The branch goes in dispatch

CYCLE 7

Page 85: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 0 992 0 83 0 134 0 10005 0 20006 0 30007 4

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-10

RS Id Busy Op Vj Vk Qj QkA1 1 0 A2 2 0 A3 3 1 brch 99 0 0 0M1 4 1 mult 13 11 0 0M2 5LS1 6 1 sto 3000 0 0 0LS2 7 0 LS3 8

- The mult proceeds and the store waits- The add writes on the CDB- The branch is issued

CYCLE 8

Page 86: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 0 992 0 83 0 134 0 10005 0 20006 0 30007 4

LQ A

SQ A Q V

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-11

RS Id Busy Op Vj Vk Qj QkA1 1 0 A2 2 0 A3 3 0 M1 4 1 mult 13 11 0 0M2 5LS1 6 1 sto 3000 0 0 0LS2 7 0 LS3 8

- The mult complets and calculates 13*11=143- The store waits- The branch complets

CYCLE 9

Page 87: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 0 992 0 83 0 134 0 10005 0 20006 0 30007 0 143

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-12

RS Id Busy Op Vj Vk Qj QkA1 1 0 A2 2 0 A3 3 0 M1 4 0M2 5LS1 6 1 sto 3000 0 0 0LS2 7 0 LS3 8

- The mult writes on the CDB (the value ‘143’)- The store gets the value ‘143’ and can finally complete

CYCLE 10

SQ: A Q V

3000 0 143

Page 88: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Com

mon D

ata Bus

I-cacheaccess

NIP CIP

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

M RS

LS RS

A RS

Reg Q V01 0 992 0 83 0 134 0 10005 0 20006 0 30007 0 143

LQ A

SQ A Q V

r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 – 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18-13

RS Id Busy Op Vj Vk Qj QkA1 1 0 A2 2 0 A3 3 0 M1 4 0 M2 5LS1 6 0LS2 7 0 LS3 8

CYCLE 11

Page 89: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Tomasulo: Summary• Reservation Stations

• Allow the "out-of-order issue" based on the availability of data (E.g. sub and add issued without waiting for the mult)

• Register Renaming (tags)+ Avoids the WAR and WAW hazards

Especially important when there are few registers available(as originally in the IBM 360)

+ Realize a dynamic "loop unrolling"- Requires a relatively complex logic

• Common Data Bus+ Simultaneously broadcast the results to more waiting instructions- It’s a "bottleneck", but it can be replicated more times

(of course at a cost greater hw)• The scheme does not handle "precise exceptions"

-14-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18

Page 90: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Tomasulo: hazard management summary

Hazard Management methodStructural on RS (RS finite) Stall in the Dispatch stage (*1)Structural on CDB (CDB occupied) Stall in the Issue stage (*2)Structural on FU (FU occupied) Stall in the Issue stage (*3)

RAW Avoided by using the tags WAR Avoided by coping operands in RS

at dispatch-timeWAW Avoided by using SW Register Renaming

(*1) avoidable with a larger number of RSs(*2) avoidable with a larger number of CDBs(*3) avoidable with multiple FUs (or can reduce with pipelined FUs)

-15-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18

Page 91: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Reservation Station -- implementation

-16-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18

dispatch

Qi Busy

CDB data

REGISTER

RES STAT.

issue: move to functional unitdispatch: move to res. station

Vk ld

MUX

OR

AND

ldQk

clr

unit

tofunctional

BusyclrOp

=0?

tofunctional

tofunctional

unitunit

compare

CDB tagRS No.

issueclrset

Full

=0?

AND

to issue logic

Qj

j operandlike

k operand

tag 1 cyclebefore data

ready

set

value

Vj

busyto dispatch logic

FFenbl

compare

MUX

Page 92: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

General organization of IBM 360/91 pipeline• “In-order” pipeline with the following stages:

• I-fetch, decode, address generation• Floating point decoupled from the Integer (Fixed Point) through memory buffers

• Effective-address generation done in the integer unit• A memory pipeline for loading the data

-17-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18

Page 93: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

IBM 360/91 -- Floating Point Unit

-18-Roberto Giorgi, Universita' degli Studi di Siena, C217ES01--SL di 18

From: R.M. Tomasulo, “An efficient Algorithm for Exploring Arithmetic Units”, IBM Journal, Jan.1967, pp.25-33

Page 94: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

http://www.dii.unisi.it/~giorgi/teaching/hpca2High Performance Computer Architecture

Branch prediction(first part)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 141

Page 95: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Stalls due to control hazards (e.g., branch)

2Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 14

F D X M

F

F

F

D

D

X

instruction 1

instruction 2(branch, e.g., bne $1, $2, label)

instruction 3

instruction 4

time

branch penalty(2 cycles)

W

F D X M W

F

D

D

X

X

MFinstruction i+1

instruction i+2

fetch instructions from the new branch

the pipeline must be flushed

branch (the condition $1==?$2 is known only after the X stage)

instruction ilabel:

Page 96: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Introduction• Programs are not linear sequences of instructions• Programs are FULL of “branch”-like instructions

3Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 14

bne $1,$0,ELSE

i =: $1, k=:$2

THEN: addi $2, $0, 1jr NEXT

ELSE: addi $2, $0, 2NEXT:

• In this case we would have 50% of branch/jump instructions(in this example 2 out of 4)

• In average, branch instructions are 15% of the total,i.e., 1 every 6-7 instructions

Example:if (i==0)

k = 1;else

k = 2;

Page 97: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Structure of programs

4Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 14

Basic-Block

An instruction sequencewith a single entry point ANDa single exit point

(e.g., an instruction sequencewith a branch or jump at the end)

Each block is about 6-7 instructions in average

Page 98: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branching• branch instructions are “expensive”

• 2-3 pipeline cycles are wasted• 15-30 cycles in the case of wide-issue deeply pipelined processors

(e.g., Pentium 4)• Memory bandwidth is wasted to fetch useless instructions• The instruction cache is likely to miss on the taken branch

• Terminology• Branch Penalty: number of wasted cycles because we need to flush the

instructions on the not taken branch (that are already in pipeline)• Branch Instruction Address (BIA): address of the branch instruction (in mem.)• Branch Target Address (BTA): address of the new Program Counter (PC)• Branch Taken: it means the control goes to a PC different from PC+(1 instr.)• Branch Not Taken: the next instruction below the branch is executed

5Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 14

Upipe down, BWmem up, I$miss up

Page 99: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (32-bit instructions)0x1000 -----

----------

0x100C BNE R1, R2, LABEL----------…-----

0x374C (LABEL:) ADD …SUB …-----…

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 146

BIA

BTA

BRANCH TAKEN BRANCH NOT-TAKEN

Page 100: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branch cost reduction - Software solutions (1)• Loop unrolling

• The loop is replicated k times and the branch instructions between replicas are eliminated

+ Increases the distance between consecutive branches

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 14

loop: ---------------b … loop

*loop: -----

----------------------------------------b … loop

*

*

*

e.g.,3 unrollings

7

Page 101: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branch cost reduction - Software solutions (2)• Instruction scheduling/Delay slot

• After the branch there may be 1 or more delay slots

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 148

loop: instr1instr2instr3b … loopinstr4instr5

DELAY SLOT Behaves like…loop: instr1

instr2instr3instr4b … loopinstr5

I can choose the “instr4” in such a way that it does work inside the iteration loop

+ the branch penalty is then hidden !+ don’t need to flush instructions from the pipeline (simplified hardware)- the software MUST always consider something to put as “instr4”(if no instruction is available, then it must put a NOP as instr4)

- Potentially dangerous if not properly managed by the programmer/compiler- In practice, we may NOT exploit the delay slot ALWAYS

Page 102: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branch cost reduction - Hardware solutions• Anticipate the target calculation (a dynamically calculated address)

and the branch decision (see next slide)

• Branch prediction & speculative execution (few slides ahead)• Try to predict WHERE to branch and IF to branch• The instruction fetch continues from the predicted address the execution becomes speculative- need to validate the prediction

(if wrong (called misprediction) we must undo what done (safe recovery))

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 149

Page 103: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Anticipate the decision: passing from 2 to 1 delay slot

10Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 14

PC Instructionmemory

4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

D/ X

0

X/M

M/W

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

F.Flush

F/ D

Signextend

Control

Mux

=

Shiftleft 2

Mux

Example: 5-stage pipeline MIPS BRANCH PENALTY = 1 CYCLE

target calculation (BTA)Decision (T/N)

T/N

BTA

Page 104: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branch Prediction• Idea: try to predict the outcome of the jump

• Choice of branch instructions to predict• Preferably: better to calibrate the predictor depending on the functionality of the

branch or the context in which the branch is located• Objective of Branch Prediction

• Minimize the branch penalty• Maximize instruction throughput• Maximize the accuracy of the predictions

• Branch prediction Accuracy: % of correctly predicted branch instructions

ABP= (no. of correctly predicted branch instr.) / (total no. of branch instr.)

• Elements of action:B1) Branch target predictionB2) Prediction of the outcome of the branch condition

• Prediction validation• It is necessary to insert mechanisms to:

- Test ‘a posteriori’ if the prediction was right or wrong (misprediction)- To restore the pipeline situation as before the prediction (safe recovery)

11Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 14

Page 105: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

B1) Branch Target Prediction• BTB (Branch-Target Buffer) or BTAC (Branch-Target Address Cache)

• Small cache that- In the “tag” field holds the Branch Instruction Address (BIA)- In the “data” field holds the PREDICTED Branch Target Address (BTA)

• Located in the Fetch stage• At the first execution of a branch instruction miss static prediction

- A BTB entry is allocated (similarly to what happens in a cache)• Then (at a next hit):

+ the (predicted) next instruction is available even BEFORE FETCHING IT+ Automatically stores those branch instructions that are more recently or more frequently

used

12Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 14

BIA BTA

Branch InstructionAddress (BIA) field

Branch targetaddress (BTA) field

PC(Program Counter)

AccessI-cache

Predicted Target Address (used as new PC)

Note: depending on chosen cache type to implement the BTB (direct access, set-associative, full-associative), the tag field will contain ALL or SOME of the BIA bits

Note2: we can use the hit-rate HBTBto characterize the BTB performance

“The TAG” “The DATA”

Page 106: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

BTB update• Having predicted the target, the branch is anyway processed

• After a few cycles we will know the actual address of the target (calculated). Let's call it “real-BTA” (real Branch Target Address)

• The calculated target address is compared with the BTA that is stored in the BTB• If real-BTA==BTA then it is not necessary to make any action• If real-BTA != BTA MISPREDICTION

- In this case we need to (safe recovery):-Undo what was done the wrong branch-Jump to the correct target

• After a branch in every case we need to update the BTB• In the BIA field we put the previous PC (old-PC)• In the BTA field we put real-BTA

13Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 14

Page 107: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

BTB location in the pipeline

14Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ03-SL di 14

PC Instructionmemory

4

Registers

F/ D

Signextend

Mux

=

Shiftleft 2

Target calcualtiondecision

RS1RS2

RD

D/ X

Bpred

BTB

instruction

T/N

BTB hit

BTA

Mux

Bpred+BTB

Mux

real-BTA

real-T/N

Note: for the sake of simplicity, the misprediction and safe recovery logic is NOT represented here

Bpred is the logic block that tries to predict the branch outcome (T/N)

Page 108: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

http://www.dii.unisi.it/~giorgi/teaching/hpca2High Performance Computer Architecture

Branch Prediction(second part)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 761

Page 109: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

B2) Prediction of the Branch Condition outcome• Based on the branch condition outcome, then we go to the target address (Taken = T) or to the next instruction (Not-taken = N)• There are several ways to make this type of prediction

• STATIC NOT-TAKEN (always-not-taken) PREDICTION• Assumes that the branch is NOT taken (N)+ Simple to implement+ No branch penalty (target instructions are already in the pipeline)• Used in the processors: Motorola 68020 (1984), VAX-11/780 (1977)- Not very efficient (ABP < 40%)

2Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

BPRED32

PC

T/N32

PC

T/N0

BPRED ALWAYS NOT-TAKEN

Page 110: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branch condition prediction: static techniques• STATIC TAKEN (always-taken) PREDICTION

• Assumes that the branch is always TAKEN (T)+ Simple to implement- 1 cycle branch penalty

(can fetch target instructions AFTER recognizing that the instruction is a branch in the decode stage)

+ works well for loops(for a n-iteration loop, at most we have 1/n wrong predictions)

- Not very efficient (ABP < 70%)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 763

BPRED32

PC

T/N32

PC

T/N1

BPRED ALWAYS TAKEN

Page 111: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Taken-branch Probability

4Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

Total probabilitytaken

not taken

not taken

not taken

taken

taken

90%

50%

60% - 70%

Probability of branches with offset <0

Benchmarks: SPEC2000

Probability of branches with offset >0

Page 112: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Static Prediction: Compiler based• Requires that the branch instructions have a reserved bit that could be

changed by the compiler in the following way:- This bit is set to 1 if TAKEN is considered the most probable outcome - This bit is set to 0 if NOT-TAKEN is considered the most probable outcome

• The compiler can decide the value of such bit according to the type of instruction or based on the result of the profiling of the program- This technique is static in the sense that the prediction is fixed for each execution of the

corresponding branch instruction (even if the input data is changed in a subsequent instance)- It has been used in the Motorola 88110 and PowerPC 601 [Smith95]

5Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

Page 113: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Static Prediction: direction-based (BTFN)• At compile time, the compiler calculates the OFFSET of the jump (= BTA-

LC)- Note: the actual PC is not known at compile time, so the Location Counter LC is used

• If OFFSET> 0 then a special branch instruction is chosen- e.g., FB or Forward-Branch- This will be predicted not-taken at run time

• If OFFSET <0 then another special branch instruction is chosen- e.g., BB or Backward-Branch- This will be predicted taken at run time

• This technique is also called Backward-Taken/Forward-Not_taken (BTFN)

6Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

Page 114: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

dirent doom fmath gcc go llong lswlr math printf

dirtaken

taken

nottaken

Static prediction comparison:always-taken, always-not_taken, direction-based

Example: benchmarks SPEC2000

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 767

ABP

Page 115: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Dynamic BP: Bimodal Predictor• The branches tend to repeat their behavior (T or N)

• We can say that the branches are "biased" towards T or N Bimodal Predictor [Smith81]

8Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

BIABranch Instruction Address

PHTPattern History Table

p

• A table indexed by p bits of the BIA (or in tandem with the BTB)

• Holds the state of a saturation counter which records the behavior (pattern) of the predominant branch (T or N)

• The saturation counter has typically two bits

Predicts Taken or Not-taken

FSM=Finite State Machine(e.g., Moore or Mealy)

0/N 1/N 2/T 3/T not takentaken

start stateFSM

j

The states are labeled from 0 to 3, and the prediction is taken if the state is >= 2A “Taken” branch (T) increments the value of the countNot-taken (N) viceversa

Page 116: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Updating the Status of the PHT – phase 2• Once the actual outcome of the branch is available, the counter is incremented in case of Taken or decremented viceversa

9Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

BIABranch Instruction Address

PHTPattern History Table

p

NewState

real T/N (2) (2)FSM

j

Old State

j

Page 117: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

DYNAMIC BP based on previous history

• The prediction is based on previous branch outcomes (T / N)• Parameters

• How many previous outcomes have to keep track of? Number of states k• Which prediction could be associated with a certain pattern? Algorithm A

• Branch History Shift Register (BHSR)• Each BTB entry may have an associated register which holds the k previous branch

outcomes (Sj==1 T, Sj==0 N)

• It works as a shift register, since for every new outcome of the corresponding branch, this register has to be updated (in order to have current information)

10Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

S1 S2 Sk-1 SkS3 …

S2 S3 Sk Sk+1

SHIFT REGISTER which holds the latest k outcomes of a branch

S4 …

BHSR

BHSR

S1

Page 118: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Dynamic BP: prediction algorithm• Prediction Algorithm

• Starting from a certain value of the BHSR, the algorithm is responsible for generating the prediction (T or N)

• The typical solution is to use a Finite State Machine (FSM), i.e. Mealy or Moore FSM, that given a k-bit input produces a 1 bit output

11Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

FSM

Prediction: T (Taken) or N (Not-taken)

kBHSR

PREDICTION

FSM

BHSR

STATUS UPDATE(1) (2)

Page 119: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

History based BP• The BTB is extended to associate a BHSR (Branch History Shift Register) to each entry (branch or set of branches)• When the PC hits in the BTB, the bits are sent to the BHSR FSM

12Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

BIA BTA

Branch InstructionAddress (BIA) field

Branch targetaddress (BTA) field

PC

AccessI-cache

Speculative target address

Branch History Table(BHT=array of BHSRs)

Predicts Taken or Not-taken

a bits a bits k bits

a

BHSR

FSM

Page 120: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

FSM options: k=1• Only 1 bit of history (k = 1), a possible FSM is the following• it remembers the last direction taken• If the prediction is correct remains in the current status• If the prediction is not correct change status (and outcome)

13Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

T/T N/N

T

TN

N

History

Prediction

Actualdirection Initial

state

Page 121: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

2-bit FSM• With two history bits (k=2), this is a possible FSM

• The prediction changes in case of two consecutive mispredictions• Otherwise it repeats the same prediction

14Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

TT/T NT/N

NN/NTN/T

TT

TT

N

N N

N

HistoryPrediction

Actual direction

Initial state

Page 122: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

How many history bits? (Lee & A.Smith) [Lee84]• 26 programs from six different types of workload (WL)

- Three different machines (IBM 370, DEC PDP-11, CDC 6400)- On average, 67.6% of the branch are "taken“- Prediction based on the opcode: ABP = 55.2% -79.8%- Prediction that uses 1 bit of history: ABP = 79.7% -96.5%- Prediction using 2 bits of history: ABP = 83.4% -97.5%- Prediction that uses 3 bits of history: ABP = 83.5% -97.7%- Prediction that uses 4 bits of history: ABP = 83.7% -98.1%- Prediction that uses 5 bits of history: ABP = 83.9% -98.2%

2 bits are sufficient for an effective prediction

BTB: test with A=1…256,C=1…4KB,WL=IBM/CPL-mix• E.g., with a BTB 4-way set associative (A=4), 128 set (C=512B) hit rate HBTB = 86.5%• A’BP=93.8%, but among the latter, PCH=4.2% changes target (i.e., “case” statements)

• Combining ABP and HBTB• ABP=(A’BP-PCH)*HBTB=(0.938-0.042)*0.865= about 78% are correctly predicted

15Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

Page 123: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Type of Algorithm (Optimal FSM) [Nair95]• Searching an optimal algorithm for a 2-bit predictor

• The 2-bit FSM generate 220 possible FSMs• The algorithms have been tested with the SPEC-89 suite on an IBM RS/6000• After eliminating the not interesting cases, the remaining FSMs are 5248

• Conclusions1) Identification of the optimal FSM (that maximizes ABP)

- The accuracy of these predictors varies from 87.1% to 97.2%2) Comparison with the “saturating counter” FSM

- In three cases the optimal predictor coincides with the saturating counter

• In the remaining cases, the optimal FSM is very close to the saturating counter

16Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

0/N 1/N 2/T 3/T not takentaken

start state

The states are labeled from 0 to 3, and the prediction is taken if the state is >= 2A “Taken” branch (T) increments the value of the countNot-taken (N) viceversa

Page 124: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Optimal Algorithm for 6 benchmarksBenchmark Optimal Counter Optimal FSM

spice2g6 97.2 97.0

doduc 94.3 94.3

gcc 89.1 89.1

espresso 89.1 89.1

li 87.1 86.8

eqntott 87.9 87.2

17Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

**

**

*

*branch taken

branch not taken

start state predict taken

predict not taken

• The results are in accordance with other previous studies

Saturating Counter

Saturating Counter

Saturating Counter

Page 125: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Limits of the previous predictors• The predictions are made by considering the history of a single branch

• Is there influence from the other branches?• Experimentally, it is found that a given branch

IS influenced by the preceding branches

• Previously, the dynamic context of the branch was not considered• In particular, the path before arriving at a given branch is relevant• The prediction algorithm may be adapted depending on the path

where the execution comes from

More accurate predictions are obtained by taking into account the history of other branches and adapting the algorithm

18Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

Page 126: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

“Two-level adaptive” Branch Prediction (1)• Scheme introduced by Yeh and Patt [Yeh91][Yeh92]

• On 9 SPEC-89 benchmarks (floating point (fp): doduc, fpppp, matrix300, spice2g6, tomcatv; integer (int): eqntott, espresso, gcc, li) with a M88110 simulator, the technique reaches on average ABP=97%, when other techniques reach only 94.4%

• The implementation consists of two sets of tables1) To record the history of the outcomes of a given jump (like others)

BHT (Branch History Table) -- each branch has its BHSR2) To record the behavior on a given history pattern each branch

has a PHT (Pattern History Table) that maintains (for each possible pattern) the current state of the FSM

19Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

Page 127: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

“Two-level adaptive” Branch Prediction (2)• Prediction scheme (phase 1)

- A certain number p of bits of the address of the branch (BIA) is used to select the table PHT related to the branch (or set of branches)

- A number m of bits of the branch address (BIA) is used to select the BHSR within the BHT relative to a branch (or set of branches); such BHSR serves to index the corresponding PHT, to get j bits of the current state of the FSM

• The output logic of the FSM produces the prediction

20Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

Branch History TableBHT

(array of BHSR) 2m entries, k bits each

00 … 0000 … 0100 … 10

11 … 1011 … 11

Pattern History Table (PHT) 2p sets of 2k entries, j bits each

Branch Instruction address(BIA), a bits

PHT bits

FSMoutputlogic

Old

Prediction

1 1 1 01

m

p

j

k bits

p ≤ m

If p == m then each BHSR has a corresponding PHT, which is indexed by the BHSR content

If p and/or m are == a, then the information BHSR or PHT exists for each branch

In general, the choice of p and/or m can be made not only from the address, but also, for example depending on the type of opcode instead of 2^m, I have s BSHRs

Page 128: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

“Two-level adaptive” Branch Prediction (3)• Predictor Update (phase 2)

• Done when the real branch target is finally known(you know the target and consequently the direction with certainty)

• 1a) The BHSR is updated by a left shift and inserting the result T or N• 1b) The PHT entry is updated with the new current status as determined

by the logic state of the FSM (depending on the outcome T or N)

21Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

Branch History TableBHT

(array of BHSRs) 2m entries, k bits each

00 … 0000 … 0100 … 10

11 … 1011 … 11

Pattern History Table (PHT) 2p sets of 2k entries, j bits each

Branch Instruction address(BIA), a bits

Branch direction (T or N)

1 1 1 01

m

p

k bits

p ≤ m

PHT bits

FSMstatelogic

Newstate

jOldstate

(1a)(1b)

Page 129: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

“Two-level adaptive” Branch Prediction (4)• This scheme is very general and in a specific implementation choices can be made for 1) m (or s), 2) p and 3) the algorithm

1) Implementation of BHT (the BHSR has always k-bit)• G (Global, m=0) – one BHSR shared by all BIA addresses• P (Per-address or individual, 0<m<=a) – each of the 2^m BHSR has an associated

group of addresses identified by the BIA m least significant bits• S (per-Set, not m, but s) – each of the BHSR has an associated set of branches

2) Implementation of the PHT• g (Global, p=0) – a single PHT is used for all BIA addresses• p (Per-address or individual, p=a) – each PHT is dedicated to a single BIA• s (Shared, 0<p<a) – each PHT is dedicated to a set of BIA addresses

3) Implementation of the algorithm• When the state of the FSM is dynamically updated,

we say that the algorithm is Adaptive and is denoted by A

22Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

s = number of branch sets, it is a number that can be determined by the opcode of instructions, a class of branch (identified by the compiler), or from the elements addressed by BIA least significant bits

Page 130: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Possible implementations of 2-level predictorsName DescriptionGAg Global Adaptive branch prediction

using one global pattern history table

GAs Global Adaptive branch predictionusing per-set pattern history tables

GAp Global Adaptive branch predictionusing per-address pattern history tables

PAg Per-address Adaptive branch predictionusing one global pattern history table

PAs Per-address Adaptive branch predictionusing per-set pattern history tables

PAp Per-address Adaptive branch predictionusing per-address pattern history tables

SAg Per-Set Adaptive branch predictionusing one global pattern history table

SAs Per-Set Adaptive branch predictionusing per-set pattern history tables

SAp Per-Set Adaptive branch predictionusing per-address pattern history tables

23Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

Page 131: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

2-level schemes with Global history

24Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

BHSR

… … …

GA

g

s

p

… … ……

k

p

a

All branches use the same BHSR to hold the history

All branches use the same PHT

Each PHT is shared by a set of branches

Each branch has an associated PHT

Page 132: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

2-level schemes with Per-address history

25Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

BHSRs

… … …

PA

g

s

p

… … ……

k

p

a

Each branch has its own history; there are 2^m BHSR

m

Each PHT is shared by a set of branches

Each branch has an associated PHT

All branches use the same PHT

Page 133: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

The PHT group selected by s has a PHT for each branch

… … ……

a

… … ……

2-level schemes with per-Set history

26Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

BHSRs

SA

g

s

p

… … ……k

p

… … ……

a

BHSR are shared by a set of branches; there are s of such sets

The PHT selected by s is shared by all the branches that belong to the set selected by s

The PHT group selected by s allows to select a PHT by p. This PHT is then shared by the branches associated with p…

s…

s

s

s p

Page 134: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Cost of implementation

Scheme name BSHR elements PHT tables Cost estimation

GAg (k) 1 1 k + 2k * jGAs (k, 2p) 1 2p k + 2p * 2k * jGAp (k) 1 2a k + 2a * 2k * jPAg (k) 2m 1 2m * k + 2k * jPAs (k, 2p) 2m 2p 2m * k + 2p * 2k * j PAp (k) 2m 2a 2m * k + 2a * 2k * jSAg (k) s sx1 s * k + s x 2k * jSAs (k, sx2p) s sx2p s * k + s x 2p * 2k * jSAp (k) s sx2a s * k + s x 2a * 2k * j

27Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

• The history length is k bits

Cf. [ Yeh93-isca]

s = number of branch sets, it is a number that can be determined by the opcode of instructions, a class of branch (identified by the compiler), or from the elements addressed by BIA least significant bits

Page 135: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Performance of Global history (GAx) scheme

0.88

0.9

0.92

0.94

0.96

0.98

1

0 2 4 6 8 10

Aver

age

Pred

ictio

n Ac

cura

cy

p=log2(Number of PHTs)

fp, k=12bit_BHSR

fp, k=8bit_BHSR

fp, k=4bit_BHSR

int, k=12bit_BHSR

int, k=8bit_BHSR

int, k=4bit_BHSR

28Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

From: [Yeh93] p=0 GAg, p!=0 GAs, p∞ GAp

Benchmarks: SPEC89, FP: eqntott, gcc, espresso, li, INT: doduc, fpppp, matrix300, spice2g6, tomcatv

GAx

Page 136: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Performance of Per-address history (PAx) scheme

29Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

0.88

0.9

0.92

0.94

0.96

0.98

1

0 2 4 6 8 10

Aver

age

Pred

ictio

n A

ccur

acy

p=log2(Number of PHTs)

fp, k=12bit_BHSRfp, k=8bit_BHSRfp, k=4bit_BHSRint, k=12bit_BHSRint, k=8bit_BHSRint, k=4bit_BHSR

From [Yeh93]

0.880.9

0.920.940.960.98

1

2 4 6 8 10 12 14 16 18Ave

rage

Pre

dict

ion

Acc

urac

y

k=Branch History Length (bits)

fp, 256_PHTs (p=8)fp, 16_PHTs (p=4)fp, 1_PHT (p=0)int, 256_PHTs (p=8)int, 16_PHTs (p=4)int, 1_PHT (p=0)

p=0 PAg,p!=0 PAs,p∞ PAp

Benchmarks: SPEC89,FP: eqntott, gcc, espresso, li,INT: doduc, fpppp, matrix300,spice2g6, tomcatv

PAx

PAx

Page 137: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Performance of per-Set history (SAx) scheme

0.88

0.9

0.92

0.94

0.96

0.98

1

0 2 4 6 8 10Aver

age

Pred

ictio

n ac

cura

cy

p=log2(Number of PHTs in each set)

fp, k=12bit_BHSR

fp, k=8bit_BHSR

fp, k=4bit_BHSR

int, k=12bit_BHSR

int, k=8bit_BHSR

int, k=4bit_BHSR

30Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

From: [Yeh93] p=0 SAg, p!=0 SAs, p∞ SAp

Benchmarks: SPEC89, FP: eqntott, gcc, espresso, li, INT: doduc, fpppp, matrix300, spice2g6, tomcatv

SAx

Page 138: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Predictor comparison (SPECint95, 8KB)

31Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

From: Oklobzidja, “The Computer Engineering Handbook - Digital Systemsand Applications, 2nd Ed”, CRC Press 2008

“Hyb”=Hybrid predictor (McFarling, see next slides)

Page 139: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Predictor comparison (SPECint95, 64KB)

32Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

From: Oklobzidja, “The Computer Engineering Handbook - Digital Systemsand Applications, 2nd Ed”, CRC Press 2008

Page 140: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branch Folding• Technique that reduces the "misprediction penalty"

• The idea is to "replace" the jump with the target instruction ("Branch folding“... to make it disappear)

• The technique is used in the PowerPC-601 (1993) analyzing the last 4 instructions in the instruction queue (already fetched)

33Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

B F D X M WIP F D X M WIP+1 F D X M WIP+2 F D X M WB F D X M WIP F D O O OIP+1 F O O O OIT F D X M WIT+1 F D X M WJ F SIT F D X M WIT+1 F D X M WIT+2 F D X M W

Correct prediction branch penalty =0

Wrong prediction misprediction branch penalty =2

O=pipeline bubble

IF the jump is unconditional always Taken

S=squashing the J is abandoned and the fetch continues from the

target address (BTA) branch penalty =0

STANDARD SITUATION

STANDARD SITUATION

BRANCH FOLDING APPLIED TO «JUMPS»

Page 141: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Advanced Branch Folding [Kavi97]• Insert the Target INstruction (TIN) besides the Branch Target Address (BTA); the TIN is the instruction that would be fetched in case of misprediction

• In this way the penalty can be reduced to 1• It can become -1 for the jumps !

34Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

B F D X M WIP F D O O OIP+1 F STIN (IT) D X M WIT+1 F D X M WJ F STIN (IT) D X M WIT+1 F D X M WIT+2 F D X M W

Advanced Branch Folding for an unconditional jumpS=squashing branch penalty = - 1

Advanced Branch Folding for a conditional branchS=squashing branch penalty = 1

ADVANCED BRANCH FOLDING APPLIED TO «BRANCH»

ADVANCED BRANCH FOLDING APPLIED TO «JUMP»

Page 142: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Correlation Predictors: gselect and gshare• To generate PHT index, Pan and (later) McFarling suggested using either the address of the branch (BIA) or the global history (BHSR)• The simplification is NOT to use a BHT with several elements

• gselect [Pan92]• Some bits of the BIA are combined with bits of global history

• gshare [McFarling93]• The bits of the BIA are "mixed" (hashed) with those of global history• The "mixing" function is usually the XOR operation

Branch address (BIA)

Global history(BHSR)

PHT indexgselect 4/4

PHT indexgshare 8/8

0000 0000 0000 0001 0000 0001 00000001

00000000 00000000 00000000 00000000

11111111 00000000 11110000 11111111

11111111 10000000 11110000 01111111

=(BIA7-4,BHSR3-0) =BIA7-0 XOR BHSR7-0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7635

Page 143: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

gselect [Pan92]• m bits from the branch address (BIA) are justified to kbits of the BHSR

• Very simple scheme: 1 BHSR and 1 small PHT…

BHSR

BIA

k

prediction

PHT

m

k+m

2k+m x jentries

BIAm-1,0

BHSRk-1,0(BHSR k-1,0,BIAm-1,0)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7636

Page 144: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

gshare [McFarling93]• m bits of the branch address BIA are XOR-ed with the BHSR

• Very simple scheme: 1 BHSR and 1 small PHT• Used in the Alpha 21264

BHSR

BIA

k

prediction

PHT

m

max{k, m}

2max{k, m} x jentries

Usually k == m

BIAm-1,0

BHSRk-1,0BHSR k-1,0 BIAm-1,0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7637

Page 145: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Performance of gshare, gselect, GAg

84

86

88

90

92

94

96

98

32 64 256 1K 4K 16K 64K

Predictor Size(bytes)

Pred

icto

r Acc

urac

y (%

)

gsharegselectglobal

From: [McFarling93]

“global” refers to the GAg predictor of Yeh and Patt

The benchmarks are the same 9 SPEC-89 used in the work of [Yeh92] and [Yeh93-ics]Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7638

Page 146: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Competitive Predictors (“Tournament”)• Some predictors work well with certain branch types

- The first proposal to use two competitive predictors is by McFarling [McFarling93] (bimodal + gshare)

• Dynamically select between a prediction between predictors- Use their history to select a predictor

• Example: Alpha 21264• Bpred1=PaG, Bpred2=Gag. Total predictor size: 29k bits• ABP = 97.4%(SPEC89 average)

99.9% (SPECfp95 average), 99% (SPECint95 average)

BHT Bpred 1

Bpred 2

MUX

prediction

Predictorselection

logic

BIA

Path History

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7639

Page 147: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

References• [Smith95] J. E. Smith, S. Weiss, “Power PowerPC 601 and Alpha 21064: A Tale of Two RISCs”,

IEEE Computer , June 1995, pp.46-48.• [Smith81] J. E. Smith, “A study of branch prediction strategies” In Proc. of the 8th Annual

Symposium on Computer Architecture, May 1981, pp. 135-148. • [Lee84] J.K.F. Lee, A. J. Smith, “Branch Prediction Strategies and Branch Target Buffer Design”,

IEEE Computer, Jan. 1984, pp. 6-22.• [Nair95] R. Nair, “Optimal 2-bit branch predictors”, IEEE Transactions on Computers, May 1995,

pp. 698-702.• [McFarling93] McFarling, S . Combining branch predictors. Technical Report TN-36, Digital

Western Research Laboratory, June 1993.• [Pan92] S. T. Pan, K. So, and J. T. Rahmeh. Improving the accuracy of dynamic branch prediction

using branch correlation. In Proceedings of ASPLOS V, pages 76–84, Boston, MA, October 1992. • [Yeh91] T. Yeh, Y. N. Patt, “Two-Level Adaptive Training Branch Prediction”, in Proc. the

International Symposium on Microarchitecture, Dec. 1991, pp.51-61. • [Yeh92] T. Yeh, Y. N. Patt, “Alternative implementations of two-level adaptive branch prediction”,

in Proc. the 19th Annual ISCA, May 1992, pp.124-134. • [Yeh93] T. Yeh, Y. N. Patt, “A comparison of dynamic branch predictors that use two levels of

branch history”, in Proc. the 20th Annual ISCA, May 1993, pp.257-266.• [Kavi97] K.M. Kavi, “Branch folding for conditional branches”, IEEE CS Technical Committee on

Computer. Architecture (TCCA) Newsletter, Dec. 1997, pp 4-7.• [Oklobzidja08] V.G. Oklobzidja, “The Computer Engineering Handbook - Digital Systems

and Applications, 2nd Ed”, CRC Press, 2008.

40Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

Page 148: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Multiple predictions• Branch instructions can be one close to the other

• Higher performance can be achieved by fetching multiple instructions - In such case we need to predict multiple targets in the same cycle

• In the next example we try to have two predictions• We assume that the we have a 2-level GAg predictor

- (Gag has di advantage of not needing a BIA)

Global BHSR(k bits)

PHT

kk-1

MUX

Primary prediction

Secondary prediction

With k bits we select an element of the PHT (primary prediction). With the least significant k-1 bits, we select the next two possible elements of the PHT and we use the primary prediction to select among the two outcomes (secondary prediction)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7641

FSM

FSM

FSM

k

k0

1

Page 149: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Loop predictor• Goal: predicting the outcome of the n-th branch in a n-interation loop

• A history-based predictor would require a BHSR with k >= n• The loop predictor consists in a table (similar to the BHT)

where each element is represented below• The limit field stores the number of iterations of a previous loop execution (n)• The prediction field stores the corresponding outcomes (T or N)• The count field tracks the number of iterations of a new loop instance• It works well only for loops (not for other situations), therefore other generic

predictors are needed anyway

Count Limit Prediction (T/N)

+1

0=

prediction

Current loop count (starts from zero and goes up to ‘limit’)

It produces sequences like:

Detects when we are at the n-th iteration (=‘limit’)

TTTTTTN TTTTTTN ...orNNNNNNT NNNNNNT ...

n n

Incremented at eachiteration

Zeroed once ‘limit’ is reached

Used in Pentium-M and Pentium-4 [Gochman03]Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7642

Page 150: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Trace Cache [Rotenber96]• It captures sequences of basic-blocks (dynamic traces)

• It replaces the classical instruction cache• It is indexed by the BTA (Branch Target Address)• The elements are basic-blocks that are assembled dynamically,

while the processor executes the program instructions• On a trace-cache hit

• A sequence of several basic blocks is fetched (i.e. containing several “taken branches”)• Design semplifications

• The processor does not need to fetch several branch targets• No need to predict the basic-block sequence!• No need for a multi-ported access

B1

B3

B4

B1

B2 B3

B4

When the code is executed the first time: the trace cache stores the basic blocks AND their sequence

The next time that the beginning of thissequence hits:there is no need topredict it it is justloaded for execution

B1 B3 B4 goes intothe trace cache

Usato nel Pentium-4Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7643

Page 151: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Further References[Alpha99] Compaq Computer Corporation, Alpha 21264 Microprocessor Hardware Reference Manual, 1999.[Boggs04] Boggs D., et. al. The Microarchitecture of the Intel® Pentium® 4 Processor on 90nm Technology.

Intel ® Technology Journal, Vol 08, Issue 01, February 18, 2004.[Gochman03] Gochman S., et. al. The Intel® Pentium® M Processor: Microarchitecture and Performance. Intel ®

Technology Journal, Vol 07, Issue 02, May 21, 2003.[Hennessy02] Hennessy, J. L. and Patterson, D. A. 2002 Computer Architecture: a Quantitative Approach. 3rd

Edition. Morgan Kaufmann Publishers Inc. 2002.[Kaeli91] D. R. Kaeli and P. G. Emma. Branch history table prediction of moving target branches due to

subroutine returns. In Proc. ISCA-18, pages 34–41, May 1991.[Shen02] Shen J. P. , Lipasti M. Modern Processor Design, McGraw Hill Higher Education; Beta Ed edition

(November 1, 2002).[Yeh93-ics] Yeh, T., Marr, D. T., and Patt, Y. N. 1993. Increasing the instruction fetch rate via multiple

branch prediction and a branch address cache. In Proceedings of the 7th International Conference on Supercomputing (Tokyo, Japan, July 19 - 23, 1993).

[Uht95] Uht, A. K., Sindagi, V., Hall, K. Disjoint eager execution: an optimal form of speculative execution. In Proceedings of the 28th Annual international Symposium on Microarchitecture (Dec. 1995).

[Rotenberg96] E. Rotenberg, S. Bennett, and J. E. Smith. Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching. 29th International Symposium on Microarchitecture, Dec. 1996

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7644

Page 152: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

EXTRA SLIDES

45Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 76

Page 153: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branch Misprediction Recovery (1)• La predizione dinamica dei salti consiste di due parti

- Parte iniziale che effettua la speculazione negli stadi iniziali della pipeline- Parte finale che effettua la validazione negli stadi successivi della pipeline

• Speculazione sui branch• Mentre si prelevano istruzioni dal ramo predetto, si puo’ incontrare un

altro branch- Es. il predittore ci consiglia “Taken” per il branch b1- Il processore preleva b2 prima che b1 sia risolto

• Soluzioni- Attendere che b1 sia risolta prima di predirre b2 ? spreco di risorse…- Predirre b2 anche se b1 non e’ risolto ? si complica la gestione del

“recovery” nel caso di misprediction

b1

b2

b3

N

N

N N

N

N N

T

T

TTTT

T

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7646

Page 154: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branch Misprediction Recovery (2)• Come si recupera (recovery) in questo caso

• Vorremmo recuperare da una situazione di misprediction multipli• Situazione piu’ complessa: primo branch mispredicted, secondo giusto• Esempio: stiamo speculando su 3 branch - b1, b2, b3

- Le predizioni sono evidenziate con la linea tratteggiata- Le istruzioni di ogni ramo predetto sono residenti nel processore

• Idea: a ogni istruzione su un dato ramo speculativo si assegna un TAG- Ogni ramo speculativo ha il proprio tag (Tag1, Tag2, Tag3)

N

N

N N

N

N N

T

T

TTTT

T

(Tag 1)

(Tag 2)

(Tag 3)Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7647

Page 155: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branch Misprediction Recovery (3)• Validazione dei branch

• Quando il branch e’ risolto (direzione e target sono noti)- CASO DI PREDIZIONE CORRETTA

-Il tag viene rimosso e le istruzioni su quel ramo divengono non-speculative- CASO DI PREDIZIONE ERRATA (MISPREDICTION)

-Il ramo errato viene bloccato ed eliminato dalla pipeline-Devono essere rimossi anche tutti i rami speculativi successivi-Il ramo corretto viene eseguito dall’inizio inserendolo in pipeline

• Esempio- Il secondo branch non e’ stato azzeccato- Tutte le istruzioni con Tag2 e Tag3 devono essere rimosse

NT

NT

NT NT

NT

NT NT

T

T

TTTT

T(Tag 2)

(Tag 3)

Ricominciare da qui!

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7648

Page 156: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Esempio - PowerPC 604 (3)• Hit nel BTAC

• Indica la presenza di un branch nella coda di fetch• L’indirizzo target prelevato dal BTAC viene usato al ciclo successivo

• Al secondo ciclo viene consultato il BHT• La predizione deve essere “taken”, se avevo fatto hit in BTAC al ciclo prima

• Se le due predizioni NON sono in accordo?• Si butta la predizione del BTAC (significa che BHT ha predetto “not-taken”)• Il fetch continua dal ramo in cascata• La predizione del BHT prevale su quella del BTAC

• Dopo aver risolto il branch si devono aggiornare sia BTAC che BHT

• Perche’ servono entrambi ?• Il BTAC e’ piu’ veloce: se la predizione e’ giusta non attendo alcun ciclo• Il BHT e’ piu’ accurato: anche se arriva un ciclo dopo, la predizione puo’

sempre essere utile

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7649

Page 157: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Esempio - PowerPC 604 (4)• Il PowerPC 604 e’ superscalare (usa lo scheduling dinamico)

• Nelle reservation-station ci possono essere fino a 4 istruzioni di salto• E’ necessario usare dei tag a 2 bit per gestire la speculazione• Segue lo schema introdotto poco sopra• In particolare le risorse occupate dalle istruzioni speculative

debbono essere liberate in caso di misprediction(es. Reorder Buffer, una struttura tipica dei processori superscalari)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7650

Page 158: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Predittore dell’indirizzo di ritorno• Alcuni salti variano l’indirizzo target nello stesso programma

• Tali salti sono tipicamente “indirect jumps”• In particolare, fra questi ci sono i salti di ritorno da procedura• Nel caso di SPEC-89 i salti da ritorno da procedura sono l’85%

• Predizione delle istruzioni di ritorno da procedura• L’esito e’ facile da predirre: always taken!• Il target non e’ facile da predirre: la stessa procedura puo’ essere

invocata da diversi punti di un programma• Il BTB per predirre il target puo’ condurre a misprediction• E’ stato proposto un piccolo stack per mantenere gli indirizzidi ritorno [Kaeli91]• Al momento della chiamata si fa push dell’indirizzo di ritorno in tale stack• Al momento del ritorno basta fare una pop da tale stack• Funziona come una cache dei piu’ recenti indirizzi di ritorno• Se tale stack e’ sufficientemente grande predice tutti i ritorni

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7651

Page 159: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Predizioni multiple [Yeh93-ics]• Idea: predirre i salti successivi anche se i precedenti non sono risolti

• Effettuare 1 predizione per ciclo

• Mentre si cerca di effettuare le predizioni successive,puo’ essere utile aggiornare speculativamente PHT e BHT

• La tecnica funziona bene se l’accuratezza della prima predizione e’ alta• 1a predizione – 96% di accuratezza =>

- 2a predizione – 92.16% accuratezza =>-4a predizione – 84.93% accuratezza

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7652

Page 160: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Eager Execution [Uht95]

•In un gruppo di 4 istruzioni e’ possibile che tutte e 4 siano branch• Si deve far ricorso ad un BTB a 4 porte

(similmente al predittore multiplo di Yeh)

•Eager Execution• Si eseguono sia il ramo taken che quello not-taken senza predizioni

- Il fetch avviene da entrambi i rami taken e not-taken- Nel lavoro di Uht, il fetch viene limitato a 6 ramificazioni- Vengono via via buttati i rami sbagliati nel momento in cui i branch sono risolti- Naturalmente viene buttato parecchio lavoro… ma e’ veloce!

• Disjoint Eager Execution- In questo caso si considera la branch prediction mentre si fa il fetch- Si prelevano solo istruzioni dai rami predetti, fino a 6 ramificazioni- Se il ramo e’ sbagliato, si fa ripartire la pipeline

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7653

Page 161: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ALPHA 21264

Branch Prediction case study

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7654

Page 162: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Processore Alpha 21264 (Feb.1998)• 500-600 MHz, 15x106 transistors, 2.2V, 0.35μ CMOS• Parola a 64-bit• Pipeline a 7 stadi• Esecuzione Superscalare a 4 vie

• Puo’ prelevare ed eseguire fino a 4 istruzioni per ciclo• L’esecuzione avviene “fuori ordine” (out-of-order)

• Secondo Hennessy e Patterson ha il predittore piu’ sofisticato implementato fino al 2003• Questo e’ basato sul predittore introdotto da McFarling nel 93.

Nota: Il processore Alpha, prodotto da Digital Equipment Corp. (DEC).DEC fu acquistata da Compaq nel 1998. Compaq fu acquistata da HP nel 2002.

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7655

Page 163: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Predittore dell’Alpha 21264 [Alpha99]• Tournament Predictor

• Sceglie dinamicamente fra due predittori- Il “predittore locale” (sinistra) – equivalente ad uno schema PAg- Il “predittore globale” (destra) – equivalente ad uno schema GAg- La selezione del predittore avviene attraverso una storia degli esiti

local/globale data in pasto al solito contatore a saturazione a 2 bit in pratica un altro predittore “di scelta”

Local historytable

(1024 x 10)

Programcounter

Local prediction(1024 x 3)

Global prediction(4096 x 2)

Choice prediction(4096 x 2)

Path history

MUX

branchprediction

Local predictor

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7656

Page 164: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Alpha 21264: Predittore Locale (PAg,m=10,k=10,j=3)

• Local History Table (LHT)• Equivalente alla BHT dello schema PAg (k=10, 1024 entry BHT (m=10))• Mantiene gli ultimi 10 esiti per un massimo di 1024 branch• E’ indicizzata dall’indirizzo dell’istruzione di salto (BIA)

• Local Prediction Table (LPT)• Equivalente alla PHT dello schema PAg (j=3, 1024 entries(k=10))• E’ indicizzata dall’elemento di storia selzionato nella LHT• La FSM e’ un contatore a saturazione a 3 bit

• La LHT e la LPT sono aggiornate dopo che il branch vienerisolto

• Funziona bene per sequenze alternativamente T e N

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7657

Page 165: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Alpha 21264: Predittore Globale (GAg,k=12,j=2)

• Tabella da 4096 elementi• E’ equivalente alla PHT dello schema GAg (j=2, 4096-entry PHT (k=12))• E’ indicizzata con un registro di storia globale a 12 bit

• Predittore• La FSM e’ un contatore a saturazione a 2-bit

• Funziona bene per branch che sono influenzati da branch precedenti• Esempio:

if (x == 10){

…}if (x % 2 == 0)…

Se questo e’ taken…

…e x non e’ cambiata qui…

…anche questo branch sara’ taken

Un predittore Globale tipicamente apprende e predice correttamentesituazioni di questo tipo

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7658

Page 166: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Alpha 21264 - Dimensione totale del predittore• Predittore Globale (k=10,j=2)

• k+2k*j = 10+4096 x 2 =~ 8K bits• Predittore Locale

• Local History Table – 2m*k= 1024 x 10 = 10K bits• Local Prediction Table – 2k*j = 1024 x 3 = 3K bits

• Predittore di Scelta• k+2k*j = 10 + 4096 x 2 =~ 8K bits

• Totale• 29K bits• ~180,000 transistors

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7659

Page 167: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Uso del predittore Globale rispetto a Locale

Percentuale di branch predetti dal local predictor

Numero di predizioni del predittore locale, normalizzato rispetto al numero totale di predizioni (locali+globali)

Fraction of predictions by local predictor

98

100

94

90

55

76

72

63

37

69

0 10 20 30 40 50 60 70 80 90 100

nasa7

matrix300

tomcatv

doduc

spice

fpppp

gcc

espresso

eqntott

li

Benchmarks SPEC89, Grafico da [Hennessy02]Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7660

Page 168: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Prestazioni dei predittori – ABP

94%

96%

98%

98%

97%

100%

70%

82%

77%

82%

84%

99%

88%

86%

88%

86%

95%

99%

0% 20% 40% 60% 80% 100%

gcc

espresso

li

fpppp

doduc

tomcatv

Branch prediction accuracy

Profile-based

2-bit counter

Tournament

Benchmarks SPEC89, Grafico da [Hennessy02]Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7661

Page 169: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Prestazioni dei predittori: mispred.rate vs. dim.

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits)

Con

ditio

nal b

ranc

h m

ispr

edic

tion

rate

Local

Correlating

Tournament

Benchmark Suite: SPEC89

Benchmarks SPEC89, Grafico da [Hennessy02]Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7662

Page 170: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

P6 i.e. PENTIUM PRO,II,III(Nov.1995)Branch Prediction case study

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7663

Page 171: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branch Prediction nel P6 [Shen02]• Risoluzione del branch (taken/not_taken)

• E’ effettuata nella JEU (Jump Execution Unit)• Il BTB predice il target non appena la IFU (Instruction Fetch Unit) lo

preleva• Tutti gli indirizzi sono verificati dal BAC (Branch Address Calculator) o

dalla JEU• Branch Target Buffer (BTB)

• Opera nei primi stadi della pipeline• Parte dall’indirizzo dell’IP (Instruction Pointer) e produce una

predizione dell’esito e del target• L’indirizzo target predetto viene inviato alla IFU per il fetch

• Aggiornamento del BTB• Il BTB viene aggiornato non appena la JEU risolve il branch• Questo puo’ essere troppo tardi, se il branch successivo arriva

nelle istruzioni immediatamente successive-Il BTB viene pertanto speculativamente aggiornato al momento della predizione

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7664

Page 172: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

P6 - Algoritmo di Branch Prediction• Basato sullo schema 2-level adaptive [Yeh92]

• Primo livello – storia degli esiti dei branch• Secondo livello – comportamento del branch per un dato pattern di storia• Differenze rispetto a [Yeh92]

- C’e’ una copia speculativa del BHT che consente di effettuare le predizioni prima che si abbia la risoluzione (e l’aggiornamento)

• Per ogni branch…• Il BTB mantiene k bit di storia “reale” (detta BHR ==BHT)

- Taken/Not-taken per gli ultimi k salti• Il BHT indicizza una tabella di 2k elementi di stato (Pattern Table – PT ==PHT)

- La FSM relativa e’ il solito contatore a saturazione• La BTB usa una pattern table “semilocale” per set

• Ogni elemento ha 4 bit di storia• Tutti gli elmenti di un set usano la stessa pattern table

• Aggiornamento speculativo del BHR• Una copia speculativa del BHR viene aggiornata con l’attuale predizione

- Tale copia viene utilizzata nel caso arrivi un branch prima che il precedente sia risolto• La BHR reale viene aggiornata con l’esito effettivo dopo la risoluzione del branch

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7665

Page 173: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Algoritmo di Branch Prediction - BTB• Se non c’e’ hit nel BTB

• Si utilizza una predizione di tipo statico:BTFN=Bachward Taken, Forward Not-taken

• Return stack• La BTB mantiene inoltre un “return stack” [Kaeli91] di 16 elementi• Questo aiuta a predirre l’indirizzo di ritorno delle funzioni

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7666

Page 174: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

PENTIUM-4 (Nov.2000)PENTIUM-M (Mar.2003)Branch Prediction case study

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7667

Page 175: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Pentium4 - Branch prediction [Boggs04]• Il predittore e’ 8 volte piu’ grande di quello del P6 (4KB)

• Secondo Intel, il piu’ sofisticato schema di predizione al 2007• L’algoritmo preciso non e’ stato divulgato

• Il predittore si combina con la Trace Cache• La Trace Cache sostituisce la Cache Istruzioni

• Componenti• Return Address Stack – 16 elementi (v. [Kaeli91])• Indirect Branch Predictor (v. slide successiva)• Loop detector (v. slide precedenti)

• Si avvale di due maniere di predirre il branch (simile al P6)• In caso di BTB miss, si usa una predizione statica BTFN• Nella versione a 90nm la predizione statica e’ migliorata (v. slide succ.)

• Miglioramenti introdotti nel Pentium-M [Gochman03]• Si usa una combinazione di:

- Meccanismi di predizione gia’ presenti nel Pentium-4- Indirect Branch Predictor- Loop detector

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7668

Page 176: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Indirect Branch Predictor• Risolve i branch indiretti ovvero dipendenti dai dati

• Il target del branch e’ posto cioe’ in un registro• Sono molto frequenti nei programmi object-oriented (Java, C++)

• Ci sono due casi• Branch indiretti con 1 target (piu’ facili da predirre)• Branch indiretti con piu’ target (es. statement “case”)

in cui il target dipende dai dati del programma

• Il predittore differenzia fra questi due casi• Data-independent

- Viene usato solo il PC per selezionare il branch target- Si memorizza il target in una tabella indicizzata dal PC

• Data-dependent- Si usa la storia globale degli esiti per selezionare il branch target- Si memorizza il target in una tabella indicizzata dalla storia globale

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7669

Page 177: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Suggerimenti software per la branch prediction• Il Pentium4 consente al software di passare suggerimenti (hints) al processore• L’hardware di Branch Prediction e di formazione della traccia

consulta questa informazione per migliorare le prestazioni• Cambiamenti nell’ISA

• Le istruzioni di branch devono essere modificate per supportare i suggerimenti

• Si aggiungono dei prefissi ai salti condizionali

• La tecnica e’ usata solo nel momento in cui si crea la traccia• Dopo la creazione della traccia gli hints del software non vengono piu’

considerati

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7670

Page 178: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Pentium4 – Predizione statica (versioni 90nm)• Per i backward branch:

• Se l’offset del salto e’ maggiore di un certo valore (empiricamente trovato)allora il branch ha scarsa probabilita’ di trovarsi in fondo ad un loop

• Viceversa si usa la predizione (statica) Taken

• Per i forward branch:• Si usa ancora Not-taken (come in BTFN)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7671

Page 179: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Misprediction rate nel Pentium4• Confronto tra due generazioni di architetture Intel (130nm vs 90nm)

SPECint_base2000 130nm 90nm164.gzip 1.03 1.01175.vpr 1.32 1.21176.gcc 0.85 0.70181.mcf 1.35 1.22186.crafty 0.72 0.69197.parser 1.06 0.87252.eon 0.44 0.39253.perlbmk 0.62 0.28254.gap 0.33 0.24255.vortex 0.08 0.09256.bzip2 1.19 1.12300.twolf 1.32 1.23

Nella tabella a lato si riporta il numerodi misprediction su 100 istruzioni di branch, nal caso dell’architetturaPentium-4 a 130 nm e a 90 nm.

Dati forniti da Intel [Boggs04]

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7672

Page 180: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

References[Alpha99] Compaq Computer Corporation, Alpha 21264 Microprocessor Hardware Reference Manual, 1999.[Boggs04] Boggs D., et. al. The Microarchitecture of the Intel® Pentium® 4 Processor on 90nm Technology.

Intel ® Technology Journal, Vol 08, Issue 01, February 18, 2004.[Gochman03] Gochman S., et. al. The Intel® Pentium® M Processor: Microarchitecture and Performance. Intel ®

Technology Journal, Vol 07, Issue 02, May 21, 2003.[Hennessy02] Hennessy, J. L. and Patterson, D. A. 2002 Computer Architecture: a Quantitative Approach. 3rd

Edition. Morgan Kaufmann Publishers Inc. 2002.[Kaeli91] D. R. Kaeli and P. G. Emma. Branch history table prediction of moving target branches due to

subroutine returns. In Proc. ISCA-18, pages 34–41, May 1991.[Shen02] Shen J. P. , Lipasti M. Modern Processor Design, McGraw Hill Higher Education; Beta Ed edition

(November 1, 2002).[Yeh93-ics] Yeh, T., Marr, D. T., and Patt, Y. N. 1993. Increasing the instruction fetch rate via multiple

branch prediction and a branch address cache. In Proceedings of the 7th International Conference on Supercomputing (Tokyo, Japan, July 19 - 23, 1993).

[Uht95] Uht, A. K., Sindagi, V., Hall, K. Disjoint eager execution: an optimal form of speculative execution. In Proceedings of the 28th Annual international Symposium on Microarchitecture (Dec. 1995).

[Rotenberg96] E. Rotenberg, S. Bennett, and J. E. Smith. Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching. 29th International Symposium on Microarchitecture, Dec. 1996

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7673

Page 181: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

MORE ONBRANCH PREDICTION

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7674

Page 182: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

CORE, CORE2, CORE-I7

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7675

Page 183: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Esempio - PowerPC 604 (2)

PC

Branch History Table (BHT)

Branch Target Address Cache

(BTAC)

I-cache

decode

dispatch

branch

+4

BHT prediction

BTAC prediction

BHT update

BTAC update

decode buffer

dispatch buffer

Reservation stationsBRN SFX SFX CFX FPL LS

Re-order buffer

execute

PC

Branch Prediction

PCPC

commit

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ04-SL di 7676

Page 184: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Introduction to Linux

Roberto GiorgiUniversita’ degli Studi di Siena

Page 185: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

2

Objectives

• High Performance Computer mostly rely on Linux• Awereness of how to interact with Linux

• Basic commands to interact with the Shell and how to productively use Linux

• File-System structure• Protection, Sharing• Advanced commands to control the machine

Page 186: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

3

UNIX• At the beginning Linux was called UNIX…

• UNIX was born at BELL/ AT&T Labs in 1969by the effort of researcher that neededmodern toolsto help them in their research projects

• UNIX was born as on Operating System featuring:• multiuser:

can differentiate among several users• multitasking:

can safely share machine resources(such as CPU, memory, storage, …)

Page 187: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

4

LINUX

• 1991: Linus Torvalds, as a hobby, creates an OS for educational use, with the same capabilities as classical UNIX and he makes it available for free on the web

• 1998: Linux reaches a large popularity and starts stealing market shares to Microsoft

• 2014: Linux is the de-facto standard for high performance computers (but also for embedded systems)• It’s largely supported by big companies like IBM, Oracle,

Intel and almost all other player in the computer world

• Several “distributions” (distros) are available that contains not only the OS but also many other applications including the source code: the user has freedom to modify the software

Page 188: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

The Shell

ControlCommands

Internet

WEB/DBServer

Shell

FileSystem

InformationProtection

CombinedCommands

Page 189: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

6

User Authentication (Login)

• Every user has the illusion of having the whole machineunder his/her complete control (virtual machine)

Login: jennyPassword:

Welcome to Unix!$

Name of the user (username)For security reasons the passwordis never shown

This message generally identifies the version of the operating system and may show some announcement from the administrator of the machine

Prompt of the command interpreter (or shell)

Page 190: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

7

Trivial security measures

• When you type a wrong name or password, the login program show the following message, but only after that the user has written *both* username and password

Login incorrect• This message advise that you did not type correctly:

• either username• or password• or that both are invalid

• This message does NOT spcify if the error is in the username or the password• This shoudl discourage unauthorized user to try names or

password just to discover them and gain access

Page 191: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

8

After the login

• After the login,you are talking to the command interpreteralso know as shell

• The Shell plays an important rolein all the interactions with the Linux Operating System

• When you type a word(in response to the prompt of the shell),the shell interprets such wordand starts consequent actions

• Such actions can be• The execution of a program• The production of an error message which tells you that

such word has not been typed correctly(or it doesn’t make sense for the OS)

Page 192: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

9

Logout

• To log out, press CTRL-Din response to the shell prompt• If CTRL-D does not work,

try the exit or the logout command• The logout command

is typically used by certain shell types (e.g., C-Shell),whilst other shells (Bourn Shell, Korn Shell, Bourn Again Shell) use CTRL-D or exit

• If the terminal is inside a graphical user interface• The logout procedure, besides disconnecting the shell

associated to the terminal, closes the window

Page 193: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

10

Correcting errors

• Since the Shell and many other utilities and toolsdo not interpret what the user has writtenuntile the key RETURN is pressed,it is possible to correct errors avoiding to send a wrong command to be blocked later

• There are three ways to correct typing errors:• Delete un key at a time by pressing a specific key:

BACKSPACE or CTRL-H• Delete the whole line by pressing the «kill key»

CTRL-U (this clean up the shell prompt)• Interrupt the shell by pressing the «interrupt key»

CTRL-C (this generates a new prompt)

• After pressing RETURN … it’s too late! You will then wait the completion of processing by the shell

Page 194: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

11

Interrupt the execution of a program

• To interrupt a running program you can pressCTRL-C• This has to be done as extreme solution since it will leave

the program in undeterminate state (data is not saved)

• When CTRL-C is pressed, the OS sends a signal to ALL executing processes (including the Shell)

• The executing program, however, can decide what to do when the interrupt signal is received• Some program can terminate immidiately• Some program can decide to ignore the signal

• When the Shell receives the interrupt signal, it creates a new prompt and waits again for new user input

Page 195: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

12

The Shell

• The Shell is the native interface of Linux/Unix• Simple• Powerful

• The Shell can appear «full screen» or as a window inside a graphical interface (terminal)• The shell interacts with the user through the

«command prompt»

• The users can use the shell as a programmin language that can execute commands for the OS• Such commands can be typed on the terminal

or read from a file (in such case the file is called «script»)

• The original shell of AT&T UNIX is the Bourne Shell (sh)• Today the most popular is the Bourne Again Shell (bash)

Page 196: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

13

Shell Commands

• The shell provides you several commands for• list, copy, move, remove files: ls, cp, mv, rm• create, remove directories: mkdir, rmdir• change working directory: cd

• Some other commands are not shell commands,but can be accessed as if they were shell commands• Utilities - e.g., to search files, search words in a file:find, grep

• Other general programs (e.g., lynx,gpg,…)• Application programs: e.g., <myprogram>

• To obtain help on using commands you can typeman <command>

Page 197: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

14

Predefined commands and Scripts

• Shell command typically include predefined programs like cd or logout

• A command can also be a «personalized command», also known as scripts• A script is a file that contains a sequence of commands• They enable user to execute compless tasks as if they

were a single command (e.g., installing software, …)

• Scripts are typically developed by system administrators or advanced users• backup procedures, exchanging files• controlling simulations, mining result data• …

Page 198: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

15

Command Syntax

• This is the typical syntax of a command:command [arg1] [arg2] … [argn] <RETURN>

• Uno or more space or TAB must be presentbetween the elements of a command line

• Arguments arg1, arg2, ... can be:• The name of a file (“filename”)• A string of text or numbers• Other objects on which the command acts• The square brackets […] mean that such argument is optional• The «angular» brackets <…> mean that the argument is mandatory

• Options are special arguments that start with «-’• Options modify the behaviour of a command, e.g.:• ls -l example <RETURN>

Page 199: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

16

Executing a command

1) The shell searches the directory list (specified by $PATH) for a file with name as the commandand which is executable

2) If the shell finds the executable file,then it creates a new process

• A Linux process is the execution itself of the program

3) The shell passes to such process the argumentsincluding the options and the name of the command

4) While the command is executing, the shall waits, sleeping, that the generated process ends

• Later will see some advanced shell characteristics

Page 200: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

17

Editors

• There exist several NON-GRAPHICAL editorswhich are useful to simply interact with the OS:• vi , VIsual editor (most popular)• joe, editor (a Wordstar emulator)• emacs editor: mostly liked by developers• vim , Vi IMproved (an improved version of vi)

• Several GUI (Graphical User Interface) editors• gedit, xemacs, kedit, kate, openoffice, ...• (not object of this introduction, ...)

• The simpler to use is joe: “joe <filename>”• You write your text and to save press CTRL-K CTRL-X

Page 201: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

File system

Shell

FileSystem

Information Protection

ControlCommands

Internet

WEB/DBServer

CombinedCommands

Page 202: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

19

Hierarchical Structure of the file systemThe structure of the Linux file system is hierarchicalThis allows the users to organize files in order to

easily find what they are looking for

DIRECTORIES

FILES

This structure is calleda “tree”

The top node is called“root” of the tree

ROOT

Page 203: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

20

Home Directory• Each user «starts» with a

directory called…home directory

• From a directory, the users can create all thesub-directoriesthat they wish

• In this way, users canexpand the directory structure to match their own needs

HOME DIRECTORY

FILES

SUB-DIRECTORY

Page 204: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

21

Filename (1)

• Each file as a name (filename)• Linux can use up to 255 characters for a filename

• For a better portability and usability 14 characters are recommended

• To avoid confusion (although is possible to use any characther), it is also recommended to use preferrebly• Capital letters “A-Z”, small letters “a-z”, numbers “0-9”,

underscore “ _ ”, dots “.” and commas “,”

• The only exception to this naming scheme is theroot directory, always indicated by the name “/”

• In the same directory we can’t have to files with the same filename

Page 205: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

22

Filename (2)

• If we share files among users of generic Unix systems (e.g., non Linux) it’s recommended to avoid too long names• They will be easier to type• Some Unix versions limit the filenames to 14 characters

• The disadvantage of short filenames is the typically they are less descriptive than longer ones

Page 206: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

23

Filename Extension

• A filename extensionis the part of the filename that follow the last dot “.”

• The extensions helpdescribe the content of the file

• The extensions can be used freelyin order to make file contents more easy to understand

• Examples:exemple.txt a file contening textexemple.html a file contening a WEB pageexemple.c a file in C language

Page 207: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

24

Invisible Filenames

• A filename that starts with dot «.» is called invisiblesince the command ls does not show it, normally

• The command ls -a shows whole files, also the invisible ones• The configuration files of programs are usually invisible:

.login, .cshrc, .logout, .profileConfiguration files provide the OS with user-specific information

or user-specific application preferences• Twp special entries are also invisible files («.» and «..»)

. Is an alias of the current directory

.. Is an alias of the parent directory

Note: the file “…” (3 dots) is a legal filename, but is by no means a reserved name like . or ..If you see a “…” filename, it may mean that some hacker is trying to cheat you !

Page 208: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

25

Absolute Pathnames

• You can type the file pathnameby tracing a path from root directory,through all intermediate directories,reaching our working file

• All filenames of the path are listed in a sequenceand separated by the «/» character

• The full pathname start with root filename (again «/»)

usr/

example/usr/example

This path is also called absolute since it locates a file uniquely in the file system

Page 209: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

26

Home directory and Working directory

• When you connect to a Linux filesystem,you will be always starting in a working directory

• This initial working directory is called home directory• The alias name of the home directory is “~”

• Once you change directory with the cd command you will be in a possibly different working directory

• The command pwd always prints the pathname of the current working directory

Page 210: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

27

Relative Pathnames

• A relative pathname: traces a path from the working directory to a given file

• Every pathname with do not start with «/» is a relative pathname

usr

/

example A

f1

Working Directory

Pathname of “f1”

Pathname of “example”

PathnameType

- /usr/A/f1 /usr/example absolute

/usr A/f1 example relative

/usr/A F1 ../example relative

Page 211: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

28

Important Standard Directories (1)

• / (root) is presente in any Linux system• /home each user has a home directory which is

typically a subdirectory of the /home directory• /bin and /usr/bin contain Linux standard programs• /etc contains configuration and administration files

for managing the system• /etc/passwd contains the list of authorized users

• /usr contains programs that are specific to a system installation• /usr/lib contains standard libraries all programs use «common» functions, that are

external to the program itself (e.g. the «printf» function to print a message): such functions are the same for any user and are available in such system libreries

Page 212: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

29

Important Standard Directories (2)

• /var files with a size that varies during the system life, e.g.: • /var/log system log: file which contains the log of all events

that happen in the system (connection, commands issued by the users, statistics on resource usage, …)/var/spool spooled files : contain files that temporarily are in the printer queue, email queue, …

• /var/lock synchronization files : generated during the execution of a program and deleted once it ends

• /dev All files that represent peripheral devices (device), like terminals, printers, disks, …

• /tmp Many programs use this directory for temporary files• /export A common directory where are files that can be shared

across systems

Page 213: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

30

Patterns

• If we need to type several similar filenameswe can make use of wildcards to specify a pattern, e.g.:• All files that start with “report”:report*

• A pattern is specifed by using special characters(called wildcards) such as “*”, “?”, “[]”

• When a wildcard appears in the command line,the shell expands that wildcardin a list of matching filenames of the working directory and passes such list tothe program to be launched by the shell

Page 214: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

31

The special character ?• The question mark specifies a single character in a

filename$ lpr memo?

• The shell expands memo? and generates a list of matching filenames of the working directory that have their name composed by «memo» followed by any single character$ ls memo?memo5 memo9 memoa memos

Note: this doesn’t match the filename «memo»

Page 215: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

32

The special character *

• The star characters may correspond to any number of characters (as a limit case to zero characters)$ ls memo*

• The star cannot correspond to an initial «.», which specifies an invisible file• To indicate a filename which starts with a «.»

you must explicitily indicated the «.» at the beginningfollowed by the desired pattern

Page 216: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

33

The special character[]• A couple of square brackets (with some characters in

between) specify that the place, where such brackets are positioned, will be substituted by any one of the characters listed in between the square brackets

• E.g.:memo[17a] corresponds to

memo1 memo7 memoa

Page 217: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Information Protection

Shell

InformationProtection

FileSystem

ControlCommands

Internet

WEB/DBServer

CombinedCommands

Page 218: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

35

Scheme of Access Permissions (1)

• The scheme of access permissions• Regulates the access to your files by other users;

you may want to share or not-share such files• Permits to keep private your confidential files

• E.g., you can decide that other users:• Can read and write into a given file that you own• Can only read (but not write) a file that you own• Can only write (but not read) a file that you own• Cannot read or write a file that you own

• You can protect a whole directory content to prevent that other (unwanted) users could read the content of the directory

• Warning: the system administrator has full access to any file(despite any permission that the file owner has specified)

Page 219: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

36

Scheme of Access Permissions (2)

• The scheme of access permission is based on the following three concepts associated to EACH file:• User is the owner of the file (one -and only one- of the users)• Group of the file (one -and only one- or the system groups)• Mask of access permission: specifies what can be done

and what is forbidden according to one of these attributes:• (r) read permission• (w) write permission• (x) execution permission

the mask specifies in compact notation (see next slides)what are the permission for the User (u), Group (g) and for any Other (o) user (who is not the associated User or belongs to the associated Group) of the file

Page 220: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

37

Access Permission MASK

• The access permission mask specifies these 3 attributes (r w x)• For the associated User (or owner)• For the associated Group (a list of users which may not own the file)• For any Other user

(implicitely defined as any user who is not the owner or not belonging to the associated group)

Hence, in total the mask will have to specify 3x3=9 permissions !• The owner and the administrator are the only two users that can

change the access permissions• For himself• For the associated group• For the others

• The access permission mask is then represented by a string like:rwx r-x r--

• The dash character « - » specifies that the corresponding permission is not enabled

Page 221: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

38

Visualizing the access permissions• The command “ls -l” shows all the above attributes of the file

file/dir

Permissions for the owner User

Permissions of the Other usersPermissions for the Group

associated Group

size date hour filename

(Number of hard links to this file)owner User

$ls -l letter.txt check_spell-rw-r--r-- 1 alex pubs 3355 May 2 10:52 letter.txt-rwxr-xr-x 2 alex pubs 852 May 5 14:03 check_spell

Page 222: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

39

Directory Access Permissions

• Access permission in this case have a slightly differentmeaning:• r permission to list the directory content (files, subdirs)• w permission to add or remove content (files, subdirs)• x permission to traverse that directory (to access subdirs)

• Morover there is a «tenth» attribute in the permission mask(the first character) which specifies whether this is• - a file• d a directory

• E.g.:$ls -ld /home/alex/infodrwxr-xr-x 3 alex pubs 512 May 2 10:52 /home/alex/info

Directory

Page 223: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

40

Link

• A link is an «alias» of the file that can be created in any point of the file system• It’s useful to avoid replication of information (e.g., big files)• To permit a personalized cliassifications of the files

that are in the directory hierarchy

Page 224: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

41

Using ln to create a link

• The ln command creates an additional link to a file• DOES NOT create another copy of the file !• It APPEARS like a copy of the file

U2

home

/

U1

f2I can then refer to the same file by using:

/home/U2/b or/home/U1/f1

a b

cd /home/U1ln -s /home/U2/b f1

f1

Page 225: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

42

Checking if a file is a link

$ cd /home/U1$ ln -s /home/U2/b f1$ ls -l f1lrw-r--r-- 1 alex pubs May 2 10:52 f1 -> /home2/U2/b

link

• We can remove a link (like a file) by issuing the command rm

• This type of link is called «soft-link», which can be created by any user, to distinguish it from a «hard-link», which can only be set by the administrator

• When removing a hard link not only the link is removed but also the pointed object… soft links are preferred

Page 226: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

43

Changing the permissions

• Changes the permission associated to file1• The <update_mask> has the following format

u + rg – wo x

$ chmod <update_mask> file1

Permession to be changed (r = read, w = write, x = execute)add (+) or (-) remove the permissionWhich of the three tuples has to be changed(u = user, g = group, o =others)

• E.g., to enable the group to write file1:$ chmod g+w file1

Page 227: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Control Commands

Shell

FileSystem

Protezione Informazioni

ControlCommands

WEB/DBServer

CombinedCommands

Internet

Page 228: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

45

LINUX commands (1)• Linux provides a large variet of commands targeted to

the common needs of managing the system and for working with more complex applications (Databases like MySQL or Oracle, Web, Scientific Applications, …)

• It’s not sufficient to start the program… to make sure that the program did not generate any error during its execution one has to check the exit code

• At the end of the execution any command provides a numerical value (called exit code) which may tell if the program finished without errors (exit code = 0) or not (exit code 0)

• A same command may be implemented differently on different Linux version or distributions: the online manual (“man”) always provide help for the current implementation

Page 229: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

46

LINUX commands (2)

• We can group commands in four main categories: • File and Directory management, e.g.:

• cp, copy a file

• Text processing commands, e.g.:• grep, search a given word (or pattern) in a file

• System status and monitoring commands, e.g.:• du, summarize the disk usage

• Software maintainance commands, e.g.:• tar, can be used for the backup and restore of files

Page 230: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

47

File and Directory management (1)

• The commands for managing files enable the user to copy, find, rename, move, remove the files

• The commands for managing directories allow us to change wirking directory, remove, create, list content of directories

Page 231: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

48

rm Remove a file or directory (with –r)cp Copy filesmv Move or rename files and directory

cd Change working directorypwd Print the pathname of the working directoryrmdir Remove a directorymkdir Create a directoryls Show the directory contentfind Search a file

File and Directory management (2)

Page 232: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

49

The ls command

• The ls command provides information about the directory content

ls [-ACFRabcdefglmnopqrstux] [name]

• For each directory that is specified in name, ls lists the content of such directory; for each file that is speficied in name, ls repeats such name and the associated information (if any)

• By default, the output is in alphabetic order• If no argument is specified then the content of the

working directory is listed

Page 233: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

50

The cd command

• The cd command serves to change the working directory

cd [directory]• If a directory is specified as argument, then such

directory becomes the working directory; if no argument is provided after cd, then the effect is to «move» the working point in the HOME-DIRECTORY

• In order to have a successful execution, it is necessary that the user who issues the cd command has the «traverse permission» on all the subdirectories that are specified in the relative pathname of the target directory, e.g.:

• cd /home/pippocd /usr/local/bin

Page 234: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

51

The pwd command

• The pwd command prints the name of the working directory

pwd/home/marco/d1

Page 235: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

52

The cp command

• The cp command generates copies of files

cp <file1> <file2>

copies file1 in file2(even if the two files are already identical)

cp <files> <directory>

copies one of more file into an existing directory(a common error, if the directory doesn’t exist, is generating instead a file with filename «directory»)

Page 236: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

53

The rm command

• The rm command removes files. It has the following syntax:

rm [-fri] <filename>

• rm removes all the files that are indicated as command line arguments

• The option –i produces an interactive execution, so that the user is asked to answer yes/no for each file that us specified it’s very useful when the filename is a pattern

• The option –f on the contrary, forces to remove the files wihout any further questions (use with caution!)

• The option –r removes recursively files an directory in the path that has been specified as filename

Page 237: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

54

The mv command

• The mv command moves or renames files or directoriesmv <file1> <file2>mv <directory1> <directory2>

• mv moves (changes name of) file1 into file2(or directory1 into directory2)

• If file2 already exists,it will be removed before file1 takes its place

mv <file> <directory>• In this latter situation the command moves file into directory

Page 238: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

55

The mkdir command

• The mkdir command creates a directorymkdir <dirname>

• Note: when a directory is created, by default it contains always and automatically at least two elements (the directory “.”, alias of the directory itself, and the directory “..”, alias of the parent directory)

Page 239: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

56

The rmdir command

• The rmdir command removes a directoryrmdir <directory>

• To have a successful completion• The directory most be «in use» by another user

(a different user may have set that directory has working directory)• The user who issues such command should have the permissions

to remove the directory and its content («w» permission)

Page 240: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

57

The find command

• The find command is useful to find files within the filesystem and apply specific actions for each found file (the simplest one being just print its filename)

find <pathname-list> [options]

• E.g.: finding files named example (-name example) in the current directory and in its sub-directories; once found, print the pathname for each located file (-print)

find . -name example –print

• Note: this is the simplest form of the find command. More complex actions can be associated in order to process the found files.

Page 241: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

58

Commands to check the status of the system

• Checing the system is the first step to verify its health • who list the names of the connected users• du summarize the disk usage• ps report process statistics• kill terminate a process or send a message to it• stty set or get terminal options• id print the id and groups of the current user• date shows or set system date and time• mail standard program to send/receive emails

Page 242: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

59

The who command

• The who command lists the users connected to the system

who [-uATHldtasqbrp] [file]who mai list usernames, terminal line, login instant, how long the user has been inactive, name of the last command issued by such user

• The option “am i“ or “am I” show information about the current user

• With the appropriate options, who can list login, logout, reboot, system clock changes, etc.

Page 243: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

60

The ps command (1)

• The ps command reports process statisticsps [options]

• ps (without options) prints information about processes associate to the current terminal• In practice, the process status can canche while ps itslef

is running; what is shown is a snapshot of the system during the command execution

• Some data could be stale since the related process could be already terminated

Page 244: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

61

The ps command (2)

• This command prints th following data:• PID (Process ID) a number between 1 and 32768• TT (terminal ID) a number which identifies the terminal• STAT (state) a string that specifies the process state• TIME minutes and seconds since process started • COMMAND name of the command that is associated to

this process (issued at the shell prompt); the option –w shows the whole command line

“ID” in general specifies an numerical dentifiers that is a unique number that identifies a given obect in the system

$ psPID TT STAT TIME COMMAND24059 11 S 0:05 -sh (sh)24259 11 R 0:02 ps

R is running, S is sleeping in an interruptible wait, D is waiting in uninterruptible disk sleep, Z is zombie, T is traced or stopped (on a signal), and W is paging

Page 245: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

62

The kill command

• The kill command sends a signal to a process, eventually killing it

kill [-signal] <processid>

• Upon reception of the signl the process may terminate (unless it is programmed to ignore or handle differently such message)

• A process can be terminated either by the owner user or by the system administrator

• If the argument is a number preceded by the «-» character, thenthe corresponding signal is sent kill -9 <pid> this is a signal that can’t be ignored by the process kill -0 <pid> is a non-kill: just query the state of the process kill -15 <pid> is an «educated kill»: the process should terminate

Page 246: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

63

The stty command

• The stty command sets or gets the terminal options (in particular with relation to the «standard input», i.e., the keyboard)

stty [-a] [-g] [options]

• stty (without arguments) prints the main current terminal settings• stty –a show ALL current terminal settings

Page 247: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

64

The du command

• The du command sumarizes the disk space usage (by default printed in ki-bytes) by a gien file or directory

du [-afrsu] [name]

• du (without arguments) prints the disk usage of the current working directory

• Disk usage should be monitored: once it is finished it can cause a system lock up

Page 248: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Combined Commands

ControlCommands

Internet

WEB/DBServer

CombinedCommands

Shell

FileSystem

InformationProtection

Page 249: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

66

Standard input and Standard output

• Standard input = input of a command/preocess• Standard output = output of a command/process• The shell directs the standard output of to a file which

represents the standard output of your terminal (or window)• e.g. /dev/tty0

• In the same way, the shell gets the standard input from the «keyboard» device

outputcommand

input

error

Page 250: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

67

Redirecting Standard Output• Redirecting standard output: the character to redirect

output (>) instructs the shell to redirect the output of a command into a file instead of the standard output

command [arguments] > filename• command is any executable program and filename is the

name of an ordinary file to which the shell redirects the output• Redirect output with caution: if the file already exists, its content

will be overwritten

outputcommand

input

error

File

Page 251: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

68

Redirecting the Standard Input

• The character to redirect input (<) instructs the shell to take the input of the command from the specified files instead of the standard input

command [arguments] < filename

outputcommand

input

error

File

Page 252: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

69

Appending Standard Output to a File

• The character to append di output (>>) allow us to add the output produced by the gien command to an existing file, without modifying any pre-existing information in such file

• This also provides a convenient way to concatenate two files into one (by using the cat command which dumps the content of the list files)

$ cat pear >> fruit_list$ cat orange >> fruit_list

Page 253: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

70

Pipelining commands

• In Linux, a pipeline is a sequence of one or more commands separated by a vertical bar (key |);the idea is that such bar represents a «pipe» between the commands• The standard output of the command before the pipe

«flows» into the standard input of the command after the pipe

• All the commands are launched as separate processes and can then send their outputs to next command of the pipeline as soon as it gets ready; the control is returned to the shell once the last command has finished

• Pipes reduce the need of intermediate files in a complex processing

command_a [arguments] | command_b [arguments]

Page 254: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

71

Pipe syntax

command_a [arguments] | command_b [arguments] • This command uses a pipe to generate the same result

of the following command lines:command_a [arguments] > tempcommand_b [arguments] < temprm temp

• Other examples:$ ls | lpr$ ls | more$ gpg –sign msg.txt | mail [email protected]

Page 255: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

72

Filters

• A filter is a command which processes a stream of upcoming data in order to produce a stream of output data

• Filters are ideally suited for pipelines

$ who | sort | lpr

• This example show the power of the shell combined with the versatilty of the Linux utilites

Page 256: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

73

The utility tee• The utility tee tee in pipeline allows us to «snoop»

into a file the content that is flowing from pipe to pipe

$ who | tee who.out | grep chas

Page 257: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

74

Launching background commands (1)

• When you launch a command, the shell waits that it finishes before returning another prompt• In this case we say that the command has gone in

foreground execution • It’s possible to execute a command also in background

• In this way, you do not need to wait for the shell prompt is it is returned immediately and you can launch another command

• Typically a command is launched in background once we now that it will take a long time to produce its output and we know that it doesn’t require interaction

• In other wors: the terminal will become immediately available to perform further work

Page 258: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

75

Launching background commands (2)

• To launch a background command, we add the ampersand (&) character at the end of the command line (before issuing the RETURN)

• The shell will print the process identification numner (PID) which identifies the process just launched in backgriund; then the shell immediately returns a new prompt (while such background process will continue to run)

•$ ls -l | lpr &31725$

Page 259: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

76

Launching background commands (3)

• If a background process generates something on the standard output and you did not redirect it, such output will be mixed to what you are doing on the same terminal afterwards !

• If a background process requests something from the standrd input and you did not redirect it, the shell automatically generates an empty string as standard input for the bacground command (at least this is the behavior of the Bourne Shell; the C-Shell stops the bacground process for input, instead)

Page 260: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

77

• You cannot use CTRL-C to stop the background process: you must communicate with the background process through signals, i.e., through the «kill» command

• If you wish to interrupt all background process, you can use the following commnd (note: there is no dash charcters before the number)

$ kill 0

Launching background commands (4)

Page 261: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

78

Launching background commands (4)

• As an alternative, the user can selectively use the kill command to stop some specific process• To retreive the PIDs of processes that are running in

background, you can use the ps command

• You can also send in background the current process in foreground (e.g., if we realize that is taking a long time to complete) by pressing the CTRL-Z key

• To resume that process in foreground you can type the command fg

Page 262: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

79

Where to find more informaton

• http://www.linux.org(Linux)

• http://www.redhat.com(RedHat/Fedora)

• http://www.tldp.org/LDP/abs/html/(Advanced Bash-Scripting Guide)

• Mark G. Sobell, A Practical Guide to Fedora andRed Hat Enterprise Linux, Sixth Edition, Prentice Hall, 2011

Page 263: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

http://www.dii.unisi.it/~giorgi/teaching/hpca2High Performance Computer Architecture

Superscalar Processors(first part)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 281

Page 264: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Processors

• Introductory Concepts

• Superscalar Processors• Instruction Fetching• Dispatch, Issue• Register renaming• Speculative execution• Memory operation

• Case studies• MIPS R10000• DEC 21164• AMD K5• Intel P6

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 282

Page 265: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Processors• Performance limitation of pipelining:

• Latch overheads & clock-skew• “Atomic” Instruction Issue logic

- (Flynn bottleneck: CPI 1)

• Improving performance by increasing the "width" of the processor: Instruction Level Parallelism (ILP)

1 2 3 4 5 6 7 8 9i IF ID EX MEM WBi+1 IF ID EX MEM WBi+2 IF ID EX MEM WBi+3 IF ID EX MEM WBi+4 IF ID EX MEM WBi+5 IF ID EX MEM WBi+6 IF ID EX MEM WBi+7 IF ID EX MEM WB

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 283

Page 266: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Processors• For each clock cycle Multiple-issue of instructions superscalar

• Often implies dynamic issue• But a “static issue” solution exist too (Alpha 21064)

• Superscalar evolution with time• One integer + one floating-point• Any two instructions• Any four instructions• Any n instructions?

(n appears to have a limit at about 3-6)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 284

Page 267: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

The Big Picture

instr.fetch

&branchpredict.

instr.dispatch

instr.issue

executioninstr.

commit&

reorderinstr.

staticprogram

windowof execution

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 285

Page 268: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Characteristics of Superscalar Processors• High performance instruction fetching

• Fetching several instructions per cycle (multiple fetch)• Branch/Jump Prediction

• Dispatch and dynamic dependency resolution• Eliminating false dependencies (anti-, output)• Correct management of true dependencies

e.g. using Tomasulo-like tags• Parallel and out-of-order instruction issue• Speculative execution• More resources that operate in parallel

• +functional units, +paths, +register ports• High performance memory systems• Methods to manage the precise exceptions

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 286

Page 269: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Generic Superscalar Processor

functional units

functional unitsand

buffersinstruction

instructionbuffers

integer/ address

data cache

interface

memory

floating pt.

registerinteger

file

floating pt.

registerfile

instr.

bufferdecodepre- instr.

cache

decode,rename,

&dispatch

re-order and commit

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 287

Page 270: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Multiple Instruction Fetch (1)• Goal: collect X instructions per cycle from the cache (I $)• Problem: due to, e.g., a branch, the X instructions can start at any point of the cache block

- Can not be solved by simply by doubling the block size (the initial instruction can be at the end of the block)

- Needs to be done by fetching two blocks AND align properly

I$

10

0

1023

I$even blocks

(bank 0)

I$odd blocks

(bank 1)

+1

0 1 0 1

99

511

0

511

0 9

10

X INSTRUCTIONS X INSTRUCTIONS X INSTRUCIOTNS

2X INSTRUCTIONS

2 3

2

4 3

3Set Address Set address Example: direct access cache

with 1024 sets, and 4-blockinstructions

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 288

Page 271: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Multiple Instruction Fetch (2)• Once 2X instructions are fetched align them properly

0 1 2 3 4 5 6 7

2 (3)Set address

(e.g., 2 in the 1st case or 3 in the 2nd case)Instruction within the set (0, 1, 2 or 3)

1st case

0 1 2 3

align switch8x4

1 2 3 4or

2 3 4 5or

3 4 5 6or

4 5 6 7 0 1 2 3

(4) 32nd case

0 1 2 3

align switch8x4

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 289

The 8x4 switch sends one of the 8 quartets (***) to the output by exploiting:

- The two least significant address bits (**)0f the instruction within the block

- the bit 0 (*) of the set address

(***)Simplified scheme:

0123 1234 2345 3456 4567 5670 6701 7012

(**)

(*)

0123 4567

(**) (*)3

Page 272: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Branch/Jump Prediction• Eliminates several control hazards

• Goal: no interruptions in the instruction flow • Multiple prediction per cycle may be necessary

• Example

Static program:loop: r3 <- mem(r4+r2)

r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

Dynamic Stream:r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0……

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2810

Page 273: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Instruction Window• Can be realized:A) as a queue (FIFO)B) as a small memory

- Trade-off between complexity and flexibility

• Moreover can be:C) partitioned in correspondence of the various types of functional units- In a manner similar to Tomasulo’s reservation-stations

D) Using a unified set of reservation stations from instruction

dispatch

to integer units

to floating point units

from instruction

dispatchto integer units

to floating point units

to load/store units

from instructiondispatch

to all units

to load/store units

A+D

A+C

B+C

In the following we consider the most general scheme B+DRoberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2811

Page 274: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Dispatch and Issue• Consider a generic implementation scheme

• At any time, the data can be:• in the registers• in the reservation stations• in the functional units• in the Re-Order Buffer (ROB)

• Renaming is used to keep track of produced values• Rx = Registers Logic, Px = Physical Registers (normally Px > Rx)

Registers

stationsreservation

functionalunits

renameregisters

reorder buffer

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2812

Page 275: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Tomasulo implemented in this model

• #Physical registers == #logical Registers (P==R)• Reservation Stations

• Partitioned according to the functional units• The access can be to any of them (random access)

- Except in the case of load/store instructions

• No reorder buffer imprecise interrupts

stationsreservation

functionalunitsRegisters

#Px == #Rx

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2813

Page 276: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

“Modern” Renaming

• Instruction Window (IW) instead of the Tomasulo’s reservation stations

• Other names: "Instruction Buffer" (IB), "Instruction Queue" (IQ)• Only control info is stored in the IW• The ROB manages precise exceptions and branch (safe)

recovery

• There are at least two renaming methods:1) Use locations in the ROB to store renamed values

- Instead of storing them in the reservation stations2) Rename in the Register-File

- In this case we have: Physical registers # > # logical Registers

• First, we will consider the method 2)Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2814

Page 277: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

MIPS R10000 method

• #Physical registers > #logical Registers (P>R)

• Moving data between registers and functional units only• The ROB is used to manage the control (reservation bits)

Registers

stationsreservation

functionalunits

reorder buffer

renameregisters

registerreservation

bits

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2815

Page 278: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Register Renaming (P registers > R registers)• Key structures: Register Map + Free Pool• Register Map (RM)

• specifies the associations between each R register and a P register• the physical register number takes the place of the “tag”• the mapping changes over time• Converts the stream of instructions in a "single assignment" form

(a register is written once and then read)• Avoids the false dependencies

• Free Pool (FP)• Holds the list of available P registers

• Operation: for each instruction1) For source Rs: read the Ps corresponding to each Rs in the RM2) For the destination R: get a new P from FP and associate it to the R3) Update the RM to reflect the new mapping

• Example: 24P and 8R (see next slides)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2816

Page 279: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 0)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

Renamed Stream

Free Poolp8,p9,p10,p11,p12,p13, p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24

RegisterMAPr1 p3r2 p4r3 p6r4 p1r5 p2r6 p7r7 p5

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2817

Page 280: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 0…)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

RegisterMAPr1 p3r2 p4r3 p6r4 p1r5 p2r6 p7r7 p5

Renamed Stream r3 <- mem(p1+p4)

Free Poolp8,p9,p10,p11,p12,p13, p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24

11

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2818

Page 281: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 0…)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

RegisterMAPr1 p3r2 p4r3 p6r4 p1r5 p2r6 p7r7 p5

Renamed Streamp8 <- mem(p1+p4)

Free Poolp8,p9,p10,p11,p12,p13, p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24

2

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2819

Page 282: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 0…)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

RegisterMAPr1 p3r2 p4r3 p8r4 p1r5 p2r6 p7r7 p5

Renamed Streamp8 <- mem(p1+p4)

Free Poolp9,p10,p11,p12,p13, p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24

3

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2820

Page 283: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 0…)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

RegisterMAPr1 p3r2 p4r3 p8r4 p1r5 p2r6 p7r7 p9

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)

Free Poolp10,p11,p12,p13, p14,p15,p16,p17,p18,P19,p20,p21,p22,p23,p24

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2821

Page 284: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 0…)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

RegisterMAPr1 p3r2 p4r3 p8r4 p1r5 p2r6 p7r7 p10

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8

Free Poolp11,p12,p13, p14,p15,p16,p17,p18,P19,p20,p21,p22, p23,p24

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2822

Page 285: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 1)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

RegisterMAPr1 p11r2 p4r3 p8r4 p1r5 p2r6 p7r7 p10

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1

Free Poolp12,p13, p14,p15,p16,p17,p18,P19,p20,p21,p22, p23,p24

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2823

Page 286: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 2)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

RegisterMAPr1 p11r2 p12r3 p8r4 p1r5 p2r6 p7r7 p10

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1 mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0

Free Poolp13, p14,p15,p16,p17,p18,P19,p20,p21,p22, p23,p24

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2824

Page 287: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 3)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

RegisterMAPr1 p16r2 p12r3 p13r4 p1r5 p2r6 p7r7 p15

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1 mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1

Free Poolp17,p18,P19,p20,p21,p22, p23,p24

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2825

Page 288: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 4)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

RegisterMAPr1 p16r2 p17r3 p13r4 p1r5 p2r6 p7r7 p15

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1 mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 – 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0Free Pool

p18,P19,p20,p21,p22, p23,p24

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2826

Page 289: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 5)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

RegisterMAPr1 p21r2 p17r3 p18r4 p1r5 p2r6 p7r7 p20

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1 mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 – 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1

Free Poolp22, p23,p24

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2827

Page 290: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 6)

Fetched Streamr3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0r3 <- mem(r4+r2)r7 <- mem(r5+r2)r7 <- r7 * r3r1 <- r1 - 1mem(r6+r2)<- r7r2 <- r2 + 8P <- loop; r1!=0

RegisterMAPr1 p21r2 p22r3 p18r4 p1r5 p2r6 p7r7 p20

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Free Poolp23,p24

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ05-SL di 2828

Page 291: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

http://www.dii.unisi.it/~giorgi/teaching/hpca2High Performance Computer Architecture

Superscalar Processors(second part)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 951

Page 292: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Pipeline

I-cacheaccess

Regs

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

DECODE+ RENAME

COMPLETE oWRITE-BACK

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 952

Page 293: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Processing

• 5 main stages• Rename• Dispatch• Issue• Complete or write-back• Commit or retire

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

RENAME

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 953

Page 294: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Processing : RENAME• This is done while the instructions are still in order• Renaming registers using the Register Map• The ROB slots are reserved now (to be discussed later)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

RENAME

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 954

Page 295: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Processing : DISPATCH• Check availability of Instruction Window slots

- If there are none, you have a stall for structural hazard- If there are enough free ones, dispatch

• Copy the renamed instructions in the IW• Copy the register reservation status bits in the IW

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

RENAME

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 955

Page 296: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Fields of an IW-SLOT

Op OpcodePi Destination register designator

associated with this instruction at renaming timePj, Pk Source register designators

associated with this instruction at renaming timeQj, Qk Tags for source register reservation

(if not zero, mean waiting for the corresponding Pj o Pk)I Immediate value (if specified by the instruction)ROB# ROB entry #, is the index of the ROB entry

associated with this instruction at renaming timeBusy The element is not available

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 956

Busy Pi Qk I ROB#QjPj Pk

1 bit e.g. 6 bits 6 bits 6 bits 6 bits

Op

1 bit 1 bit 16 bit e.g. 8 bit

IW-SLOT:

Note: a window element contains only control information - no data (except 1 immediate value)

Page 297: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Processing: ISSUE• Wake-up:

• Occurs when the tags (Qj, Qk) are both zero(the instruction has all the necessary data, as in Tomasulo)

• Select:• The issue logic maps the requested resources to those available• The selected instructions read their values from the physical

registers and send them to the functional units for the execution

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

RENAME

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 957

Note: the tag matching logic is not represented in the figure.Only the data paths are represented. The tag matching isperformed by sanding tags to the Instruction Window and Registers

Page 298: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Processing: COMPLETE• Write-back of the results into the registers• The Pi / Pk of IW will indicate those elements that will receive the results too• The Qj / Qk will be zeroed accordingly to the received values

• If there has been an exception, it is annotated in the ROB(using an “Exc flag”)

• Finally, the ROBentry is markedas completed(but not deleted yet)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

RENAME

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 958

Page 299: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Fields of a ROB-SLOT

Busy The element is not availablePC Program Counter of the instructionRi Logical Destination register at rename time

associated with this slot at renaming timePi,old Previous Pi in the Register Map when Ri was renamed

associated with this slot at renaming timeSt Indicates that this instruction was a STOREExc Indicates that this instruction got an exception

Cplt Flag to indicate that the instruction has completed

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 959

Busy Ri Pi,old

1 bite.g. 32 bits 6 bits 6 bits

PCROB-SLOT:

Note: The ROB is managed as a CIRCULAR QUEUE

St

1 bit

Cplt

1 bit

Exc

Page 300: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Processing: COMMIT• New phase that handles the reordering of instructions

• Any instruction that:- i) it is on the top of the ROB- ii) AND it is marked has COMPLETED

It is then declared has COMMITTED• More instructions can commit at the same cycle (e.g., 4)

• If there was no exception• The committed ROB entries can be freed• The produced value in the Pi can be copied in the Ri• The Ri can be remapped to its “old Pi” in the RM

• However, if there has been an exception, we must step back!• The entry marked with the Exc flag has to be eliminated • And the same happens to all the preceding instructions (not yet

committed) in the ROB

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9510

Page 301: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Managing an exception• Activate the exception-bit in the Exception Register

• Stop issue and commit

• Entry-by-entry: use the old-Pi value in the ROB to restore the Register Map• The original situation, BEFORE the time when the instruction that

caused the instruction was dispatched, is restored

• The EXCEPTION HANDLER is then launched

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9511

Page 302: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Pipeline

I-cacheaccess

(logical)Regs

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Integer

DECODE + RENAME COMPLETE oWRITE-BACK

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9512

Physicalregs

RegisterMap

Free

Poo

l

Re-OrderBuffer

InstructionWindow

COMMIT

R:8P:24IW:32ROB:36LQ:3SQ:3

Page 303: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 0)Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

track R7:

P5

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9513

Page 304: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 1)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P5

P10P9

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9514

Page 305: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 2)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P5

P10P9

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9515

Page 306: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 3)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P5

P10P9

P15P14

*

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9516

* == in flightIssued CompletedeXecution Committed

Page 307: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 4)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P5

P10P9

P15P14

*

*

*

**

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9517

* == in flightIssued CompletedeXecution Committed

Page 308: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 5)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P5

P10P9

P15P14

P20P19

**

*

**

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9518

* == in flightIssued CompletedeXecution Committed

Page 309: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 6)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P5

P10P9

P15P14

P20P19

*

*

**

*

*

**

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9519

* == in flightIssued CompletedeXecution Committed

Page 310: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 7)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P10P9

P15P14

P20P19

*

*

*

**

**

*

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9520

* == in flightIssued CompletedeXecution Committed

Page 311: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 8)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P10P9

P15P14

P20P19

**

**

*

*

*

**Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9521

* == in flightIssued CompletedeXecution Committed

Page 312: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 9)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P10P9

P15P14

P20P19

*

*

**

*

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9522

* == in flightIssued CompletedeXecution Committed

Page 313: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 10)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P10P9

P15P14

P20P19*

*

*

*

**

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9523

* == in flightIssued CompletedeXecution Committed

Page 314: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 11)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P10P9

P15P14

P20P19

*

**

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9524

* == in flightIssued CompletedeXecution Committed

Page 315: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 12)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P10

P15P14

P20P19

*

*

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9525

* == in flightIssued CompletedeXecution Committed

Page 316: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 13)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P15P14

P20P19

P10

*

***

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9526

* == in flightIssued CompletedeXecution Committed

Page 317: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 14)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P15P14

P20P19

**

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9527

* == in flightIssued CompletedeXecution Committed

Page 318: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 15)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P15

P20P19

*

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9528

* == in flightIssued CompletedeXecution Committed

Page 319: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 16)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P20P19

P15

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9529

* == in flightIssued CompletedeXecution Committed

Page 320: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 17)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P20P19

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9530

* == in flightIssued CompletedeXecution Committed

Page 321: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 18)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P20

*

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9531

* == in flightIssued CompletedeXecution Committed

Page 322: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 19)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

DISPATCH ISSUE

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

P20

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9532

* == in flightIssued CompletedeXecution Committed

Page 323: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 20)

I-cacheaccess

Decode

Regs

WRITE-BACK

Mult. 1 Mult. 2 Mult. 3

AddressAdd D-Cache 1 D-Cache 2

Integer

Mult. 4

LQ

SQ

Window

Integer

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

track R7:

*

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9533

* == in flightIssued CompletedeXecution Committed

Page 324: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example -- Summary

Renamed Stream dispatch issue completep8 <- mem(p1+p4) 2 3 6p9 <- mem(p2+p4) 2 4 7p10 <- p9 * p8 2 7 12 p11 <- p3 – 1 2 3 4mem(p7+p4)<-p10 3 5 14p12 <- p4 + 8 3 4 5P <- loop; p11!=0 3 4 5p13 <- mem(p1+p12) 4 6 9 p14 <- mem(p2+p12) 4 7 10p15 <- p14 * p13 4 10 15p16 <- p11 – 1 4 5 6mem(p7+p12)<- p15 5 8 17p17 <- p12 + 8 5 6 7P <- loop; p16!=0 5 6 7p18 <- mem(p1+p17) 6 9 12p19 <- mem(p2+p17) 6 10 13p20 <- p19 * p18 6 13 18p21 <- p16 – 1 6 7 8mem(p7+p17)<- p20 7 11 20p22 <- p17 + 8 7 8 9P <- loop; p21!=0 7 8 9

Performance:1 iteration every 3 cycles

Focusing on one interation:

IPClimite = NINST/Cycles=7/3 ≈ 2.3

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9534

Page 325: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Superscalar Processing: ROB• ROB operation

• Manage the instruction termination (in-order commit)• Freeing the physical registers that are no longer necessary• Recover the state to manage precise exceptions• Safe recovery in case of mispredicted branch

• ROB entry fields• PC Program Counter• Ri Number of the destination (logical) register• Pi,old Number of the previous mapping of Ri in the RM• St Store-flag (the instruction is performing a store operation)• Exc Exception flag (the instruction caused an exception)• Cplt Complete flag (the instruction has completed)

• The ROB is managed as a circular buffer

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9535

Page 326: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Memory Operations• We need separated Issue Buffers for Loads and Stores

• Necessary to calculate the Effective Address IN-ORDER(the EAs must be calculated in-order to avoid memory-based hazards)• The load/store issue is therefore FIFO

• As “usual”:• The stored is queued and waits the data to be written into the EA

• For Stores, the St flag is set to 1 in the ROB

• Advanced solutions allow a load to bypass a store(before that the store has completed the EA calculation)• In such case we can use the ROB to step back, in case of conflict

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9536

Page 327: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Op Pi Pj Pk Qj Qk I ROB#ld p8 p1 p4 0 0 - 1ld p9 p2 p4 0 0 - 2ml p10 p9 p8 1 1 - 3sb p11 p3 0 0 1 4

Example (cycle 2) – Dispatch of instr. 0,4,8,C

Entry PC Ri Pi,old St Exc. Cplt1 0 r3 p6 0 02 4 r7 p5 0 03 8 r7 p9 0 04 C r1 p3 0 0

RegisterMAP

r1 p11 p3r2 p4r3 p8 p6r4 p1r5 p2r6 p7r7 p10 p9 p5

Free Poolp12,p13, p14,p15,p16,p17,p18,P19,p20,p21,p22,p23,p24

WINDOW

ROB

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p22!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9537

CYCLE 1

CYCLE 0PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Page 328: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 3)

Entry PC Ri Piold St Exc. Cplt1 0 r3 p6 0 02 4 r7 p5 0 03 8 r7 p9 0 04 C r1 p3 0 05 10 - - 1 06 14 r2 p4 0 07 18 - - 0 0

Op Pi Pj Pk Qj Qk I ROB#ld p8 p1 p4 0 0 - 1ld p9 p2 p4 0 0 - 2ml p10 p9 p8 1 1 - 3sb p11 p3 0 0 1 4st p10 p7 p4 0 0 5ad p12 p4 0 0 8 6br p11 1 0 0 7

RegisterMAP

r1 p11r2 p12 p4r3 p8r4 p1r5 p2r6 p7r7 p10

Free Poolp13,p14,p15p16,p17,p18p19,p20,p21p22,p23,p24

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9538

CYCLE 2

Issued CompletedeXecution CommittedDispatch 10,14,18 – Issue 0,C

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Page 329: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 4)

Entry PC Ri Pi,old St Exc Cplt1 0 r3 p6 0 02 4 r7 p5 0 03 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 07 18 - - 0 08 0 r3 p8 0 09 4 r7 p10 0 010 8 r7 p14 0 011 C r1 p11 0 0

Op Pi Pj Pk Qj Qk I ROB#ld p9 p2 p4 0 0 - 2ml p10 p9 p8 1 1 - 3st p10 p7 p4 0 0 5ad p12 p4 0 0 8 6br p11 0 0 0 7ld p13 p1 p12 0 1 8ld p14 p2 p12 0 1 9ml p15 p14 p13 1 1 10sb p16 p11 0 0 1 11

RegisterMAP

r1 p16 p11r2 p12r3 p13 p8r4 p1r5 p2r6 p7r7 p15 p14 p10

Free Poolp17,p18,p19p20,p21,p22p23,p24

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9539

CYCLE 3

Issued CompletedeXecution CommittedDispatch 0”,4”,8”,C” – Issue 4,14,18 – Complete C

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Page 330: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 5)

Entry PC Ri Pi,old St Exc Cplt1 0 r3 p6 0 02 4 r7 p5 0 03 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 09 4 r7 p10 0 010 8 r7 p14 0 011 C r1 p11 0 012 10 - - 1 013 14 r2 p12 0 014 18 - - 0 0

Op Pi Pj Pk Qj Qk I ROB#ml p10 p9 p8 1 1 - 3st p10 p7 p4 0 0 5ld p13 p1 p12 0 0 8ld p14 p2 p12 0 0 9ml p15 p14 p13 1 1 10sb p16 p11 0 0 1 11st p15 p7 p12 0 0 12ad p17 p12 0 0 8 13br p16 0 0 0 14

RegisterMAP

r1 p16r2 p17 p12r3 p13 r4 p1r5 p2r6 p7r7 p15

Free Poolp18,p19,p20p21,p22,p23p24

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9540

CYCLE 4

Issued CompletedeXecution CommittedDispatch 10”,14”,18” – Issue 10,C” – Complete 14,18 – Commit C

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Page 331: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Getting back the Physical Registers (into the Free Pool)

• A physical register can be freed (ignoring exceptions) once the last read has completed

• However, taking into account precise exceptions,a physical register can be freed after the corresponding logical register has been updated

• Therefore the Pi,old indicates when such physical register can be freed as the ROB entry has reached the commit

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9541

Page 332: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 6)

Entry PC Ri Pi,old St Exc Cplt1 0 r3 p6 0 12 4 r7 p5 0 03 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 09 4 r7 p10 0 010 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 014 18 - - 0 015 0 r3 p13 0 016 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 0

Op Pi Pj Pk Qj Qk I ROB#ml p10 p9 p8 1 0 - 3ld p13 p1 p12 0 0 8ld p14 p2 p12 0 0 9ml p15 p14 p13 1 1 10st p15 p7 p12 0 0 12ad p17 p12 0 0 8 13br p16 0 0 0 14ld p18 p1 p17 0 1 15ld p19 p2 p17 0 1 16ml p20 p19 p18 1 1 17sb p21 p16 0 0 1 18

RegisterMAP

r1 p21 p16r2 p17r3 p18 p13r4 p1r5 p2r6 p7r7 p20 p19 p15

Free Poolp22,p23,p24

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9542

CYCLE 5

Issued CompletedeXecution CommittedDispatch 0”’,4”’,8”’,C”’ – Issue 0”,14”,18” – Complete 0,C” – Commit 14,18

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Page 333: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 7)

Entry PC Ri Piold St Exc Cplt2 4 r7 p5 0 13 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 09 4 r7 p10 0 010 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 016 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 019 10 - - 1 020 14 r2 p17 0 021 18 - - 0 0

Op Pi Pj Pk Qj Qk I ROB#ml p10 p9 p8 0 0 - 3ld p14 p2 p12 0 0 9ml p15 p14 p13 1 1 10st p15 p7 p12 0 0 12ld p18 p1 p17 0 0 15ld p19 p2 p17 0 0 16ml p20 p19 p18 1 1 17sb p21 p16 0 0 1 18st p20 p7 p17 0 0 19ad p22 p17 0 0 8 20br p21 1 0 0 21

RegisterMAP

r1 p21r2 p22 p17r3 p18r4 p1r5 p2r6 p7r7 p20

Free Poolp23,p24p6

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

ROB free

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9543

CYCLE 6

Issued CompletedeXecution CommittedDispatch 10”’,14”’,18”’ – Issue 8,4”,C”’ – Complete 4,14”,18” – Commit 0,C” – ROBfree 0

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Page 334: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 8)

Entry PC Ri Pi,old St Exc Cplt3 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 09 4 r7 p10 0 010 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 016 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 119 10 - - 1 020 14 r2 p17 0 021 18 - - 0 0

Op Pi Pj Pk Qj Qk I ROB#ml p15 p14 p13 1 1 10st p15 p7 p12 0 0 12ld p18 p1 p17 0 0 15ld p19 p2 p17 0 0 16ml p20 p19 p18 1 1 17st p20 p7 p17 0 0 19ad p22 p17 0 0 8 20br p21 0 0 0 21

RegisterMAP

r1 p21r2 p22r3 p18r4 p1r5 p2r6 p7r7 p20

Free Poolp23,p24,p6p5

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9544

Issued CompletedeXecution Committed ROB freeIssue 10”,14”’,18”’ – Complete C”’ – Commit 4,14”18” – ROBfree 4

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Page 335: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 9)Op Pi Pj Pk Qj Qk I ROB#ml p15 p14 p13 1 0 10ld p18 p1 p17 0 0 15ld p19 p2 p17 0 0 16ml p20 p19 p18 1 1 17st p20 p7 p17 0 0 19

Entry PC Ri Pi,old St Exc Cplt3 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 16 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 19 4 r7 p10 0 010 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 016 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 119 10 - - 1 020 14 r2 p17 0 121 18 - - 0 1

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9545

Issued CompletedeXecution Committed ROB freeIssue 0”’ – Complete 0”,14”’,18”’ – Commit C”’

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Page 336: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 10)Op Pi Pj Pk Qj Qk I ROB#ml p15 p14 p13 0 0 10ld p19 p2 p17 0 0 16ml p20 p19 p18 1 1 17st p20 p7 p17 0 0 19

Entry PC Ri Pi,old St Exc Cplt3 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 16 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 19 4 r7 p10 0 110 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 016 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 119 10 - - 1 020 14 r2 p17 0 121 18 - - 0 1

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9546

Issued CompletedeXecution Committed ROB freeIssue 8”,4”’ – Complete 4” – Commit 0”,14”’,18”’

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Page 337: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 11)Op Pi Pj Pk Qj Qk I ROB#ml p20 p19 p18 1 1 17st p20 p7 p17 0 0 19

Entry PC Ri Pi,old St Exc Cplt3 8 r7 p9 0 04 C r1 p3 0 15 10 - - 1 16 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 19 4 r7 p10 0 110 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 016 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 119 10 - - 1 020 14 r2 p17 0 121 18 - - 0 1

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9547

Issued CompletedeXecution Committed ROB freeIssue 10”’ – Commit 4”

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Page 338: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example (cycle 12)Op Pi Pj Pk Qj Qk I ROB#ml p20 p19 p18 1 0 17

Entry PC Ri Pi,old St Exc Cplt3 8 r7 p9 0 ovf 14 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 19 4 r7 p10 0 110 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 116 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 119 10 - - 1 020 14 r2 p17 0 121 18 - - 0 1

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9548

Issued CompletedeXecution Committed ROB freeComplete 8,0”’ – OVERFLOW 8

PC048C1014180”4”8”C”10”14”18”0”’4”’8”’C”’10”’14”’18”’

Page 339: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Exception management (1)

Entry PC Ri Pi,old St Exc Cplt3 8 r7 p9 0 ovf 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 19 4 r7 p10 0 110 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 116 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 119 10 - - 1 120 14 r2 p17 0 121 18 - - 0 1

Recover Register State

RegisterMAP

r1 p21r2 p22r3 p18r4 p1r5 p2r6 p7r7 p20

Free Poolp23,p24,p6,p5

All the changes on registers must be UNDONE, starting from the most recent instruction and going back in ROB

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9549

Page 340: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Exception management (2)

Entry PC Ri Pi,old St Exc Cplt3 8 r7 p9 0 ovf 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 19 4 r7 p10 0 110 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 116 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 119 10 - - 1 120 14 r2 p17 0 121 18 - - 0 1

Recover Register State

RegisterMAP

r1 p21r2 p17r3 p18r4 p1r5 p2r6 p7r7 p20

Free Poolp23,p24,p6,p5,p22

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9550

Page 341: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Exception management (3)

Entry PC Ri Pi,old St Exc Cplt3 8 r7 p9 0 ovf 04 C r1 p3 0 15 10 - - 1 06 14 r2 p4 0 17 18 - - 0 18 0 r3 p8 0 19 4 r7 p10 0 110 8 r7 p14 0 011 C r1 p11 0 112 10 - - 1 013 14 r2 p12 0 114 18 - - 0 115 0 r3 p13 0 116 4 r7 p15 0 017 8 r7 p19 0 018 C r1 p16 0 119 10 - - 1 120 14 r2 p17 0 121 18 - - 0 1

Recover Register State

RegisterMAP

r1 p3r2 p4r3 p8r4 p1r5 p2r6 p7r7 p9

Free Poolp23,p24,p6,p5,p22,p21,p20, p19,p18,p17,p16,p15,p14,p13,p12,p11,p10

trap PC = 8

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9551

Page 342: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Speculative Execution• In case of branch prediction, we try to execute the instructions in a speculative way• Correct prediction: no loss of performance• Wrong prediction: elimination of speculative instructions

• The Reorder buffer can be used to manage such elimination• We add "speculative bits” (Spec field) to each ROB entry• An instruction cannot commit if its speculative bits are set If the speculation is correct, we clear the speculative bits If the speculation is wrong, we do not commit the speculative instructions that are in the ROB

- They are eliminated and we use an old-RM to restore the mapping

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9552

Page 343: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example+speculation (cycle 2)Entry PC Ri Pi,old St Exc Cplt Spec1 0 r3 p6 0 0 02 4 r7 p5 0 0 03 8 r7 p9 0 0 04 C r1 p3 0 0 0

RegisterMAP

r1 p11r2 p4r3 p8r4 p1r5 p2r6 p7r7 p10

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9553

Page 344: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example+speculation (cycle 3)Entry PC Ri Pi,old St Exc Cplt Spec1 0 r3 p6 0 0 02 4 r7 p5 0 0 03 8 r7 p9 0 0 04 C r1 p3 0 0 05 10 - - 1 0 06 14 r2 p4 0 0 07 18 - - 0 0 0

RegisterMAP

r1 p11r2 p12r3 p8r4 p1r5 p2r6 p7r7 p10

RegisterMAP 1

r1 p11r2 p12r3 p8r4 p1r5 p2r6 p7r7 p10

snapshot

Predict TAKEN

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9554

Issued CompletedeXecution Committed ROB free

Page 345: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example+speculation (cycle 4)Entry PC Ri Pi,old St Exc Cplt Spec1 0 r3 p6 0 0 02 4 r7 p5 0 0 03 8 r7 p9 0 0 04 C r1 p3 0 0 05 10 - - 1 0 06 14 r2 p4 0 0 07 18 - - 0 0 08 1C r3 p8 0 0 19 20 r7 p10 0 0 110 24 r7 p14 0 0 111 28 r1 p11 0 0 1

RegisterMAP

r1 p16r2 p12r3 p13r4 p1r5 p2r6 p7r7 p15

RegisterMAP 1

r1 p11r2 p12r3 p8r4 p1r5 p2r6 p7r7 p10

Setting the speculative bits

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9555

Issued CompletedeXecution Committed ROB free

Page 346: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example+speculation (cycle 5)Entry PC Ri Pi,old St Exc Cplt Spec1 0 r3 p6 0 0 02 4 r7 p5 0 0 03 8 r7 p9 0 0 04 C r1 p3 0 1 05 10 - - 1 0 06 14 r2 p4 0 1 07 18 - - 0 1 08 0 r3 p8 0 0 19 4 r7 p10 0 0 110 8 r7 p14 0 0 111 C r1 p11 0 0 112 10 - - 1 0 113 14 r2 p12 0 0 114 18 - - 0 0 1

RegisterMAP

r1 p16r2 p17r3 p13r4 p1r5 p2r6 p7r7 p15

RegisterMAP 1

r1 p11r2 p12r3 p8r4 p1r5 p2r6 p7r7 p10

Reset the speculative bitsCorrect prediction

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9556

Issued CompletedeXecution Committed ROB free

10101010101010

Page 347: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example+speculation (cycle 6)Entry PC Ri Pi,old St Exc Cplt Spec1 0 r3 p6 0 1 02 4 r7 p5 0 0 03 8 r7 p9 0 0 04 C r1 p3 0 1 05 10 - - 1 0 06 14 r2 p4 0 1 07 18 - - 0 1 08 0 r3 p8 0 0 09 4 r7 p10 0 0 010 8 r7 p14 0 0 011 C r1 p11 0 1 012 10 - - 1 0 013 14 r2 p12 0 0 014 18 - - 0 0 015 0 r3 p13 0 0 216 4 r7 p15 0 0 217 8 r7 p19 0 0 218 C r1 p16 0 0 2

RegisterMAP

r1 p21r2 p17r3 p18r4 p1r5 p2r6 p7r7 p20RegisterMAP 1

r1 p11r2 p12r3 p8r4 p1r5 p2r6 p7r7 p10

RegisterMAP 2

r1 p16r2 p17r3 p13r4 p1r5 p2r6 p7r7 p15

Predict TAKEN

snapshotof RM0

at cycle 5

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9557

Issued CompletedeXecution Committed

Page 348: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example+speculation (cycle 7)Entry PC Ri Piold St Exc Cplt Spec2 4 r7 p5 0 1 03 8 r7 p9 0 0 04 C r1 p3 0 1 05 10 - - 1 1 06 14 r2 p4 0 1 07 18 - - 0 1 08 0 r3 p8 0 0 09 4 r7 p10 0 0 010 8 r7 p14 0 0 011 C r1 p11 0 1 012 10 - - 1 0 013 14 r2 p12 0 1 0 14 18 - - 0 1 015 0 r3 p13 0 0 216 4 r7 p15 0 0 217 8 r7 p19 0 0 218 C r1 p16 0 0 219 10 - - 1 0 220 14 r2 p17 0 0 221 18 - - 0 0 2

RegisterMAP

r1 p21r2 p22r3 p18r4 p1r5 p2r6 p7r7 p20

RegisterMAP 2

r1 p16r2 p17r3 p13r4 p1r5 p2r6 p7r7 p15

Wrong prediction(assumption)

Restore themapping

The entries with speculative bits==2 are eliminated from ROB

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9558

Issued CompletedeXecution Committed

Page 349: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Example+speculation (cycle 8)Entry PC Ri Piold St Exc Cplt Spec3 8 r7 p9 0 0 04 C r1 p3 0 1 05 10 - - 1 1 06 14 r2 p4 0 1 07 18 - - 0 1 08 0 r3 p8 0 0 09 4 r7 p10 0 0 010 8 r7 p14 0 0 011 C r1 p11 0 1 012 10 - - 1 0 013 14 r2 p12 0 1 0 14 18 - - 0 1 0

RegisterMAP

r1 p16r2 p17r3 p13r4 p1r5 p2r6 p7r7 p15

The speculative instructionsare eliminated also from the pipeline

Renamed Streamp8 <- mem(p1+p4)p9 <- mem(p2+p4)p10<- p9 * p8p11<- p3 - 1mem(p7+p4)<- p10p12<- p4 + 8P <- loop; p11!=0p13 <- mem(p1+p12)p14<- mem(p2+p12)p15<- p14 * p13p16<- p11 - 1mem(p7+p12)<- p15p17<- p12 + 8P <- loop; p16!=0p18 <- mem(p1+p17)p19<- mem(p2+p17)p20<- p19 * p18p21<- p16 - 1mem(p7+p17)<- p20p22<- p17 + 8P <- loop; p21!=0

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9559

Page 350: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Renaming in the ROB

#logical registers == #physical registers• The ROB contains also the "renamed“ values• The ROB commits in-order• The registers provide the values only at dispatch time

• The values may also come from the FUs or the ROB itself• Similar to the method used in Intel P6 architecture

(i.e., Pentium Pro, II, III) and PowerPC 604

Registers

WindowIssue

functionalunits

reorder bufferRAT

Registers

WindowIssue

FunctionalUnits

reorder bufferRAT

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9560

Page 351: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Register Alias Table (RAT)

• Used during the renaming (at dispatch time)

• The RAT contains (for each register)Register flag (1 if the most recent value is in the Register File)ROB flag (1 if the most recent value is in the ROB)ROB# (if the value is in the ROB, it points to the entry)

If both Register flag and ROB flag are ZERO the value is yet to be produced

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9561

RAT-SLOT:1 bit

Rf ROBf

1 bit

ROB#

e.g. 6 bits

Page 352: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Issue Buffer

• Similar to Tomasulo’s Reservation Stations

• ContainsBusy Currently busy (not available)Op OpcodeQj, Qk Tag fields (indicates the assoicated ROB slots)Vj, Vk Value of the operands (sources)ROB # associated ROB elements that holds the result

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9562

Busy ROB#

1 bit e.g. 6 bits

Op

e.g. 8 bit

IB-SLOT: QkQj

1 bit 1 bit

Vj

16 bits

Vk

16 bits

Page 353: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ROB Renaming

• 4 major steps• Dispatch• Issue• Complete (writeback)• Commit (retire)

remove from head

when dispatched

reserve entry at tai l. . .

. . .when execution finished

place data in entry

. . .

when neededbypass to other instructions

w hen complete (commit)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9563

Page 354: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ROB Renaming: Dispatch

• Check for slot in issue window• If not available, stall due to structural hazard• If window available, assign window slot and ROB

entry• Copy Opcode and ROB# to issue window slot• For each source operand, consult RAT

• If value in register, copy value to issue window slot• If value in ROB, copy value to issue window slot• Else, copy ROB# to issue window slot Q field• For destination register

- Place ROB# into RAT- Clear register and ROB flags in RAT

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9564

Page 355: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ROB Renaming Dispatch

Registers

WindowIssue

functionalunits

reorder bufferRAT

Registers

WindowIssue

FunctionalUnits

reorder bufferRAT

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9565

Page 356: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ROB Renaming: Issue

• Wakeup:• When all values are ready in window slot

(Instruction has all required data)• Select:

• Issue logic arbitrates requested and available resources• Selected instructions read register values and issue to

functional units for execution

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9566

Page 357: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ROB Renaming Issue

Registers

WindowIssue

functionalunits

reorder bufferRAT

Registers

WindowIssue

FunctionalUnits

reorder bufferRAT

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9567

Page 358: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ROB Renaming: Completion

• Write result into ROB entry• Update RAT; set ROB flag if ROB entry matches ROB# in RAT• I.e. this is the most recent version

• Window source entries monitor ROB#• All matching entries capture result data

• Place any exceptions in ROB• Mark ROB entry complete

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9568

Page 359: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ROB Renaming Completion

Registers

WindowIssue

functionalunits

reorder bufferRAT

Registers

WindowIssue

FunctionalUnits

reorder bufferRAT

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9569

Page 360: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ROB Renaming: Commit

• When instruction at head of ROB is complete• If no exception

• Remove instruction from ROB• Copy value to register file• Update RAT: if ROB entry matches ROB# in RAT

- set register flag

• If exception• Set exception bit in trap register• Stop issue and commit• Register file holds correct state• Vector to trap handler

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9570

Page 361: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ROB Renaming Commit

Registers

WindowIssue

functionalunits

reorder bufferRAT

Registers

WindowIssue

FunctionalUnits

reorder bufferRAT

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9571

Page 362: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Multi-Ported Caches

• For superscalar memory bandwidth• True multi-ported

- (higher cost, slower access)• Interleaved

- (control complexity)• Multi-access per cycle

- (slower clock)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9572

cycle1/2 clock

mux

RAM

RAM

Cache

pipe 2 address

pipe 1 data

pipe 2 data

pipe 1 address

pipe 2 address

pipe 1 address EvenAddresses

OddAddresses

pipe 1 data

pipe 2 data

pipe 2 data

pipe 1 datapipe 1 address

pipe 2 address

Page 363: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

SUPERSCALAR CASE STUDIES

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9573

Page 364: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Case Study: MIPS R10000

Decode

DispatchMap

BHT512x2

Adder

FPRegisters

32 logical64 physical

RegistersInteger

mul/ divALU2

shift

ALU1

FP

Add

FPMult

div/ sqrt

32 logical64 physical

Pre-

decode

PC

16FP buffers

Addr.

ITLB

8 entry

ActiveList

TableMap

16Buffers

16

Ld/ StoQueue

Integer

DTLB64

entries

DataCache32 KB2-way

Intsr.Cache32 KB

2-way

128

L2

Cache

Interface

ResumeCache

44

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9574

Page 365: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

MIPS R10000, cont

• 4-way superscalar (dispatch and commit)• 1 memory• 2 ALU• 2 FP

• Register renaming (32 64)• Reorder buffer (active list)• 4-deep branch speculation• Resume cache: keeps instructions on non-predicted path

for fast recovery• 2-way 32 KB set assoc. instr. and data caches

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9575

Page 366: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Case Study: DEC 21164

L2

Cache

96 KB

3-way

Decode

BHT

PC

ITLB

Intsr.Cache

direct8 KB

2K x 2

32 logical

RegistersInteger

FP

32 logical

Registers

DTLB48 entry

FPAddDiv.

FPMult

ALU1

ALU2

branch

128

shift

128

48 entry

DataCache

InstructionBuffers2 x 4

Dispatch

8 KBdirect

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9576

Page 367: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

DEC 21164, cont

• 4-way superscalar• 2 memory• 2 ALU• 2 FP

• Issue in-order• Interrupts RSR+bypasses; FP Imprecise• 1-deep branch speculation• Direct-mapped 8 KB instr. and data caches• On-chip L2 Cache

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9577

Page 368: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Case Study: AMD K5

BHT

PC

Intsr.Cache

1K x1

ROPDecode

Dispatch

40 words

Registers

ALU1

Ld/ Sto 0

Ld/ Sto 1

FPU

Branch

shiftALU0

DataCache

16 KB

8 KB

Reorder Buffer

Predecode

TLB128 entry

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9578

Page 369: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

AMD K5, cont

• 4-way superscalar• 2 memory• 2 ALU• 1 FP• 1 Branch

• Use ROB renaming• Reservation Stations hold control only – not data• Data read from ROB + register file at issue time

• 16 entry reorder buffer• 1-deep branch speculation• 16 KB instr. and 8 KB data cache• CISC instructions converted to RISC Ops (ROPs)

Registers

stationsreservation

functionalunits

reorder buffer

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9579

Page 370: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

CISC to ROP conversion

• ROPs are like microcode• Predecode helps

identify instruction boundaries

• CISC instructions placed into Byte Queue

• Byte Queue instructions broken into 4 ROPs per cycle ROP

ConvertROP

ConvertROP

ConvertROP

Convert

Instruction Cache

Byte Queue

add EAX,[EBP+d8] (2 ROPs)cmp EAX,imm32 (1 ROP)push ECX (2 ROPs)

add EAX,[EBP+d8] cmp EAX,imm32 push ECX

Parse/ Duplicate

load temp,[EBP+8]add EAX,temp

cmp EAX,imm32sub ESP,4

str [ESP],ECX

str [ESP],ECX

MROM

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9580

Page 371: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Case Study Intel P6

Registers

Reorder Buffer

DataStore

AddrStore

AddrLoad

ALU1

ALU0FPU

ROPDecode

Dispatch

PC

Intsr.Cache8 KB

BTB512 x4

queuebyte

instruction

MemoryReorderBuffer

DataCache8 KB

TLB64 entry

Stations (20)Reservation

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9581

Page 372: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Intel P6, cont

• Uses ROPs (Ops) similar to K5• 20 unified reservation stations• ROB renaming similar to K5• Up to 3 simple instr decode/cycle; only 1 complex

• Up to 5 ROPs issue per cycle • Up to to 3 ROPs retire per cycle• Max 1 load and 1 store per cycle

• Hit under miss L1 cache• CPU and L2 cache on MCM• 512 entry, 4-way set assoc. BTB

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9582

Page 373: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Intel P6 Pipeline Timing

• 12 stage pipeline• Requires good branch prediction

Reg.BTBAccess

I-Cache Access Decode/ ROP Gen.

Rename Dispatch Issue Execute

Retire

D-Cache AccessAddr.Add

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9583

Page 374: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Deep Pipelines

• Another important trend

Dec 1 Dec 2 Exec Wrt Bck

F1 F2 D1 D2 D3 Rn ROB Rd/Sch Disp Ex Ret1 Ret2

IP1 IP2 TC1 TC2 Dr Al Rn Q S1 S2 S3 Dp1 Dp2 R1 R2 Ex Flgs Br Dr WB

P5 5 stages

P6 12 stages

Willamette 20 stages

Pref

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9584

Page 375: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Source: "A Wire-Delay Scalable Microprocessor Architecture for High Performance Systems," S.W. Keckler, Doug Burger, C.R. Moore, R. Nagarajan, K. Sankaralingam, V. Agarwal, M.S. Hrishikesh, N. Ranganathan, and P. Shivakumar. International Solid-State Circuits Conference (ISSCC), pp. 1068-1069, February, 2003.

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9585

Page 376: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Intel Core-i7 (Nehalem Architecture)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9586

Page 377: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Intel Core-i7-2 (Sandy-Bridge Architecture)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9587

Page 378: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Intel Core-i7-4 (Haswell Architecture)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9588

Page 379: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Intel Superscalar Comparison

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 9589

Page 380: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ARM Cortex-M• Why it is important:

90Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 95

Page 381: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

ARM Cortex-M7

91Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 95

Page 382: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

AMD K10 (years 2007-2012)

92Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 95

By appaloosa - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=3535462

Page 383: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

AMD 15h “Bulldozer” (years 2011-present)

93Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 95

By Shigeru23 - Made by uploader (ref:[1], [2], [3]), CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=17130257

Page 384: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

AMD Ryzen (2017-present)

94Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 95

Page 385: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

RISC-V (2011-present) – Open Source Chip

95Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ06-SL di 95

Page 386: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSON TUTORIAL

‘o e to Gio giO to e

Roberto Giorgi, University of Siena, of 861

Page 387: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COT“o Outli e

• COT“o Co epts• O e ie of featu es• Co pa iso ith othe e aluatio app oa hes• “i ulatio o figu atio• Vi tualize o e ie “i No• ‘u i g COT“o E a ples• Case stud : a al zi g CJPEG• Ad a ed featu es: defi i g a ‘egio Of I te est

2Roberto Giorgi, University of Siena, of 86

Page 388: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSON CONCEPTS

3Roberto Giorgi, University of Siena, of 86

Page 389: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COT“o Wo ki g “pa es• COT“o eates o ki g spa es:

– HO“T “PACE– GUE“T “PACE

• The HO“T “PACE is he e all odelsa d the si ulatio - o t ol u– E.g., a a he odel

• The GUE“T “PACE is he e a Vi tualized Platfo u s– E.g., a s ste ith Li u /Wi do s/A d oid

• The t o spa es o u i ateth ough a ell defi ed i te fa e– The GUE“T a e i st u ted to se d fu tio al

i fo atio to the HO“T t a spa e tl to a GUE“T fu tio alit a d GUE“T soft a e

– The HO“T a o t ol the si ulatio , defi e e e ts a d thei ti i g, feed a k the guest ith the app op iate ti i g i fo as if it as i te a ti g ith a eal o po e t

GUE“T “PACE

HO“T “PACE

Roberto Giorgi, University of Siena, of 864

Page 390: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Fu tio al Di e ted “i ulatio

• Fu tio al Qua tu s o . s of “i se o ds “i ulated ti e• A out s of Host se o ds “i ulatio Ti e o Wall Clo k Ti e

Roberto Giorgi, University of Siena, of 86

Fu tio alMode GUEST

Ti i gMode HOST

F ee u i g fo sPassi g s of e e tsCal ulati g a tual ti i g e.g., s

Wall Clo k Ti e

“i ulated Ti ee.g. slo do

. . .

5

Ask fo s QUANTUM of e e ts CPI=

Page 391: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Classi al E aluatio Methodolog

• Pu poses of the o o , sha ed platfo– E aluate the esea h pe fo ed ea h pa t e– Tra sfer the respe ti e k o ledge to the other

part ers

COTSonEVALUATIONPLATFORM

APPS PERFORMANCE

METRICS

APPOUTPUT

fibx.c mmx.c

# cores

spee

dup

1 2 3 4 avg

mmulfib

APP

INPUT

Roberto Giorgi, University of Siena, of 866

Page 392: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

OVERVIEW OF FEATURES

7Roberto Giorgi, University of Siena, of 86

Page 393: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COT“o O e ie

Roberto Giorgi, University of Siena, of 868

Page 394: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COMPARISON WITH OTHER APPROACHES

9Roberto Giorgi, University of Siena, of 86

Page 395: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Co pa iso a o g diffe e t app oa hes fo doi g esea h elated to ultiple odes e aluatio I fo atio e ised f o data of the ‘AMP p oje t

SMP Cluster FPGA E ulator Si ulator

S ala ilit1K ores

C A A A A

Cost 1K ores F € M C B € . - . M A+ € . M A+ € . M

Po er/Spa e K , ra ks D k , a ks

D k , a ks

A . k , . a ks A+ . k , . a ks A+ . k , . a ks

O ser a ilit D C A+ A+ A+

Reprodu i ilit B D A+ A+ A+

Re o figura ilit D C A+ A+ A+

Credi ilit A+ A+ B+/A- F/D C

De elop e t ti e B B C A+ A+

Perfor a e lo k A GHz A GHz C . GHz B ≈ . of o igi al C / to / “MP

6-6 ISA A+ A+ F A+ A+

Modifia le F F B A A

GPA D D B+/A- B A

Roberto Giorgi, University of Siena, of 8610

Page 396: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Fu tio al-Di e ted app oa h• A ti i g si ulato a also use diffe e t app oa hes

depe di g o the elatio ship et ee the fu tio al odel f a d the ti i g odel t [Maue ]: i

fu tio al-first o tra e-dri e , the f is u fi st a d sepa atel a d the t is u late o i a o pletel de oupled fashio the all f is u efo e the t is u ; ii ti i g dire ted o e e utio d i e , the f a d t a e losel oupled o de oupli g ; iii ti i g-first , the t

d i es the f , oth a e o pletel de oupled, ut the fu tio has to e he ked late o a d e e tuall u do e; i fu tio al-dire ted , the f d i es the t , oth a e o pletel de oupled, the fu tio is al a s the ight o e ut e eed a ti i g feed a k f o t to o e t the

ti i g. [A gollo ].

Roberto Giorgi, University of Siena, of 8611

Page 397: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

FUNCTIONAL/TIMING SIMULATION

Functional Simulator

Timing Simulator

Functional-First (Trace-driven)+Fast- No timing feedback

+ Timing feedback- Tight Coupling- Very slow

Timing and FunctionalSimulator Integrated (SimOS)

- Complex, no reuse, very slow

Timing-Directed (Exec-driven)Functional Simulator

Timing Simulator

Complete TimingNo? Function

No TimingComplete Function

Timing-First (Multifacet)Functional Simulator

Timing Simulator

Complete TimingPartial Function

No TimingComplete Function

+ Timing feedback+ Using existing simulators+ Software development advantages- Slow

Sour e: Multifa et Proje t . s. is .edu/ ultifa et -[Mauer -sig etri s-Full_S ste Ti i g_First Si ulatio ]

speed

Complete Function Partial Timing

au

a

Roberto Giorgi, University of Siena, of 8612

Page 398: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSon: FUNCTIONAL-DIRECTED

• A a ia t of fu tio al fi st• Adds ti i g feed a k at oa se g a ula it

s – s of i st u tio s

• Appli atio s see a app o i atio of ti e• Ma iss so e fi e-g ai ti i g i te a tio

• Co pati le ith fast a hi g e ulato s a d sa ple s

Roberto Giorgi, University of Siena, of 8613

Page 399: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

OTHE‘ ‘ECENT X “IMULATO‘“

Multi- ode

Adapted from:Heirman120401-ISPASS TutorialThe SNIPER multi-core simulator

Ti i g-dire ted/i tegrated

Roberto Giorgi, University of Siena, of 8614

Page 400: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

SIMULATION CONFIGURATION

15Roberto Giorgi, University of Siena, of 86

Page 401: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratioThe COT“o si ulatio i f ast u tu e is o t olled setti g all the ele a t i fo atio a out si ulatio a d ta get s ste o figu atio i a i put o figu atio file.

COT“o uses lua s ripti g la guage to a age this o figu atio file:

Fi st se tio of the file des i es glo al optio s;

“e o d se tio of the file des i es the “i No o figu atio ;

Thi d se tio of the file des i es the ta get s ste o figu atio ;

Roberto Giorgi, University of Siena, of 8616

Page 402: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratioA si ple e a ple of a COT“o o figu atio file o i g ith the si ulato i stallatio :

. Let s assu e the COT“o i stallatio di e to as:

./ otso

. Let s o e to the di e to :

otso @ otso 1$ d ./ otso /tru k/sr /e a ples

. Ope the si ple CPU si ulatio e a ple:

otso @ otso 1$ gedit o e_ pu_si ple.i

Roberto Giorgi, University of Siena, of 8617

Page 403: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

optio s = {

a _ a os = " M", sa ple = {

t pe = "si ple", ua tu = " k

}, hea t eat = {

t pe = "file_last", logfile = "o e_ pu_si ple.log"

}, }

Glo al optio s se tio

Roberto Giorgi, University of Siena, of 8618

Page 404: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

optio s = {

a _ a os = " M", sa ple = {

t pe = "si ple", ua tu = " k

}, hea t eat = {

t pe = "file_last", logfile = "o e_ pu_si ple.log"

}, }

“et the du atio of the e ti e si ulatio to illio s of

a ose o ds

Glo al optio s se tio

Roberto Giorgi, University of Siena, of 8619

Page 405: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

optio s = {

a _ a os = " M", sa ple = {

t pe = "si ple", ua tu = " k

}, hea t eat = {

t pe = "file_last", logfile = "o e_ pu_si ple.log"

}, }

Cal ulate ti i g fo all e e ts;HO“T/GUE“T s h o izatioe e k s . us , i.e. ua tu

Glo al optio s se tio

Roberto Giorgi, University of Siena, of 8620

Page 406: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

optio s = {

a _ a os = " M", sa ple = {

t pe = "si ple", ua tu = " k

}, hea t eat = {

t pe = "file_last", logfile = "o e_ pu_si ple.log"

}, }

“et the si ulato fo pe iodi all sto i g si ulatio statisti s. The pe iod et ee t o sto es is alled heart eat.

Glo al optio s se tio

Roberto Giorgi, University of Siena, of 8621

Page 407: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

optio s = {

a _ a os = " M", sa ple = {

t pe = "si ple", ua tu = " k

}, hea t eat = {

t pe = "file_last", logfile = "o e_ pu_si ple.log"

}, }

“i ulatio statisti s a e e o ded i a log file alled o e_ pu_si ple.log

Glo al optio s se tio

Roberto Giorgi, University of Siena, of 8622

Page 408: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

si o . o a ds = fu tio

use_ sd ./ otso /t u k/data/ p. sd'

use_hdd ./ otso /t u k/data/ka i .i g'

set_jou al

se d_ke oa d 'g -O - - /ho e/use /test.i'

e d

Si No o figuratio se tio

Roberto Giorgi, University of Siena, of 8623

Page 409: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

si o . o a ds = fu tio

use_ sd ./ otso /t u k/data/ p. sd'

use_hdd ./ otso /t u k/data/ka i .i g'

set_jou al

se d_ke oa d 'g -O - - /ho e/use /test.i'

e d

Si No o figuratio se tio

The set of “i No o a ds a e g ouped i a fu tio

Roberto Giorgi, University of Siena, of 8624

Page 410: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

si o . o a ds = fu tio

use_ sd ./ otso /t u k/data/ p. sd'

use_hdd ./ otso /t u k/data/ka i .i g'

set_jou al

se d_ke oa d 'g -O - - /ho e/use /test.i'

e d

Si No o figuratio se tio

“pe if the positio i the host o pute , he e the “i No o figu atio B“D esides.

The p. sd set a si gle CPU a hi e.

Roberto Giorgi, University of Siena, of 8625

Page 411: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

si o . o a ds = fu tio

use_ sd ./ otso /t u k/data/ p. sd'

use_hdd ./ otso /t u k/data/ka i .i g'

set_jou al

se d_ke oa d 'g -O - - /ho e/use /test.i'

e d

Si No o figuratio se tio

“pe if the positio i the host o pute , he e the ha d-disk i age esides.

The kar i 64.i g set s all ha d-disk ith the U u tu Ka i - Li u i age i stalled.

Roberto Giorgi, University of Siena, of 8626

Page 412: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

si o . o a ds = fu tio

use_ sd ./ otso /t u k/data/ p. sd'

use_hdd ./ otso /t u k/data/ka i .i g'

set_jou al

se d_ke oa d 'g -O - - /ho e/use /test.i'

e d

Si No o figuratio se tio

E a le the jou ali g of the file s ste see the p e ious a tio use_hdd

Roberto Giorgi, University of Siena, of 8627

Page 413: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

si o . o a ds = fu tio

use_ sd ./ otso /t u k/data/ p. sd'

use_hdd ./ otso /t u k/data/ka i .i g'

set_jou al

se d_ke oa d 'g -O - - /ho e/use /test.i'

e d

Si No o figuratio se tio

Allo the use to u a o a d i side the O“ of the si ulated a hi e i.e., U u tu Ka i -

Li u . I this ase the ha d-disk i age o tai s a test.i file that is o piled ith g

i side the si ulated s ste .

Roberto Giorgi, University of Siena, of 8628

Page 414: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

fu tio uildi=

hile i < disks dodisk = get_disk idisk:ti e { a e = 'disk'..i, t pe = "si ple_disk" }i = i+

e di=

hile i < i s doi = get_ i ii :ti e { a e = ' i '..i, t pe = "si ple_ i " }

i = i+e d…

Target s ste o figuratio se tio

Roberto Giorgi, University of Siena, of 8629

Page 415: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

fu tio uildi=

hile i < disks dodisk = get_disk idisk:ti e { a e = 'disk'..i, t pe = "si ple_disk" }i = i+

e di=

hile i < i s doi = get_ i ii :ti e { a e = ' i '..i, t pe = "si ple_ i " }

i = i+e d…

Target s ste o figuratio se tio

The o figu atio of the si ulated a hi e is des i ed ithi a fu tio alled uild .

Roberto Giorgi, University of Siena, of 8630

Page 416: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

fu tio uildi=

hile i < disks dodisk = get_disk idisk:ti e { a e = 'disk'..i, t pe = "si ple_disk" }i = i+

e di=

hile i < i s doi = get_ i ii :ti e { a e = ' i '..i, t pe = "si ple_ i " }

i = i+e d…

Target s ste o figuratio se tio

Fi st, the fu tio he ks the u e of atta hed disks to the s ste . Fo ea h dis o e ed disk, the fu tio set the p ope ti e .

Roberto Giorgi, University of Siena, of 8631

Page 417: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio

fu tio uildi=

hile i < disks dodisk = get_disk idisk:ti e { a e = 'disk'..i, t pe = "si ple_disk" }i = i+

e di=

hile i < i s doi = get_ i ii :ti e { a e = ' i '..i, t pe = "si ple_ i " }

i = i+e d…

Target s ste o figuratio se tio

“i ila l to the ase of the disks, the fu tio he ks the u e of atta hed et o k

i te fa es NICs to the s ste . Fo ea h dis o e ed NIC, the fu tio sets the p ope ti e .

Roberto Giorgi, University of Siena, of 8632

Page 418: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio= pus

if ~= thee o "This e pe i e t o l a ts to ha dle pu"

e dpu = get_ pupu:ti e { a e = ' pu ', t pe = "ti e " }e = Me o { a e = " ai ", late = }

l = Ca he{ a e = "l a he", size = " kB",

li e_size = , late = , u _sets = , e t = e , ite_poli = "WB", ite_allo ate = "t ue"

}…pu:i st u tio _ a he ipu:data_ a he dpu:i st u tio _tl itpu:data_tl dt

e d

Target s ste o figuratio se tio

Roberto Giorgi, University of Siena, of 8633

Page 419: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio= pus

if ~= thee o "This e pe i e t o l a ts to ha dle pu"

e dpu = get_ pupu:ti e { a e = ' pu ', t pe = "ti e " }e = Me o { a e = " ai ", late = }

l = Ca he{ a e = "l a he", size = " kB",

li e_size = , late = , u _sets = , e t = e , ite_poli = "WB", ite_allo ate = "t ue"

}…pu:i st u tio _ a he ipu:data_ a he dpu:i st u tio _tl itpu:data_tl dt

e d

Get the u e of CPU i stalled i the ta get s ste . If the u e is diffe e t f o , the si ulatio is stopped.

CPU u e i g sta ts f o .

Target s ste o figuratio se tio

Roberto Giorgi, University of Siena, of 8634

Page 420: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio= pus

if ~= thee o "This e pe i e t o l a ts to ha dle pu"

e dpu = get_ pupu:ti e { a e = ' pu ', t pe = "ti e " }e = Me o { a e = " ai ", late = }

l = Ca he{ a e = "l a he", size = " kB",

li e_size = , late = , u _sets = , e t = e , ite_poli = "WB", ite_allo ate = "t ue"

}…pu:i st u tio _ a he ipu:data_ a he dpu:i st u tio _tl itpu:data_tl dt

e d

“et a e o de i e ep ese ti g the e te al D‘AM of the ta get s ste . Fo the e o is possi le to spe if the ai featu es, su h as the late .

“ele t the CPU ith ID , a d the set the ti e fo the CPU.

Target s ste o figuratio se tio

Roberto Giorgi, University of Siena, of 8635

Page 421: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio= pus

if ~= thee o "This e pe i e t o l a ts to ha dle pu"

e dpu = get_ pupu:ti e { a e = ' pu ', t pe = "ti e " }e = Me o { a e = " ai ", late = }

l = Ca he{ a e = "l a he", size = " kB",

li e_size = , late = , u _sets = , e t = e , ite_poli = "WB", ite_allo ate = "t ue"

}…pu:i st u tio _ a he ipu:data_ a he dpu:i st u tio _tl itpu:data_tl dt

e d

I a hie a hi al fashio , all the o po e t of the ta get s ste a e spe ified. I this ase, the se o d le el a he is i sta tiated. Fo the o po e t the ai featu es a e o figu ed e.g., late , size, u e of sets .

Target s ste o figuratio se tio

Roberto Giorgi, University of Siena, of 8636

Page 422: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio o figuratio= pus

if ~= thee o "This e pe i e t o l a ts to ha dle pu"

e dpu = get_ pupu:ti e { a e = ' pu ', t pe = "ti e " }e = Me o { a e = " ai ", late = }

l = Ca he{ a e = "l a he", size = " kB",

li e_size = , late = , u _sets = , e t = e , ite_poli = "WB", ite_allo ate = "t ue"

}…pu:i st u tio _ a he ipu:data_ a he dpu:i st u tio _tl itpu:data_tl dt

e d

O e all the o po e ts a e i sta tiated, the a e o e ted to the CPU, alli g a set of

fu tio s.

Target s ste o figuratio se tio

Roberto Giorgi, University of Siena, of 8637

Page 423: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

VIRTUALI)ER OVERVIEW SIMNOW

38Roberto Giorgi, University of Siena, of 86

Page 424: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Si No o er ieSi No is a i tualize that is used i COT“o as the fu tio al si ulato .It has ee de eloped AMD. It allo s the use to o figu e a full-s ste ta get a hite tu e ha gi g the a ious o po e ts e.g., CPU t pe, u e of CPUs, ai e o size a d o ga izatio , et . .

The ai featu es of “i No a e:

“e e al CPU odels a e a aila le: Tu io , Athlo , Opte o , …; D a i -t a slatio of i st u tio s: the i st u tio i put st ea ta get I“A

is t a slated i to a C-like e ui ale t ode that is the o piled fo the ati e a hi e;

Dete i isti e e utio .

Roberto Giorgi, University of Siena, of 8639

Page 425: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Si No o er ieThe o figu atio of the fu tio all si ulated a hi e is sto ed i a o figu atio file alled B“D B oadS o d Do u e t . The file o tai s all the

de i es, the de i e att i utes, the o e tio poi ts AND a s apshot of the u e t poi t of si ulatio . The “i No dist i utio o es ith se e al B“D o figu atio s.

“i No is o t olled a shell hi h o t ols the si ulatio , dis o e s de i es, i statiates de i es, i te fa es ith de i es.

He e a de i e is a sha ed li a that i te fa es ith the shell e i o e t.

F o a use ie poi t, the si ulatio e i o e t is o posed of t o ai i do s oth a e hidde du i g a COT“o si ulatio :

“i No Co a d Wi do fo sake of si pli it e a ig o e this ; “i No Use I te fa e Wi do .

Roberto Giorgi, University of Siena, of 8640

Page 426: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Si No User I terfa e Wi do

• The Video Output A ea: displa s the guest s ee as i the displa of a eal s ste Li u , Wi do s, A d oid

41Roberto Giorgi, University of Siena, of 86

Page 427: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Si No Para etersWhi h de i es a e o figured?

Disk i ages: ep ese t ph si al d i e s su h as ha d-disks .hdd i age o d-d i e s .iso i age . Fo ha d-disks, DiskTool a e used to eate a e pt de i e o hi h i stalli g a O“;

BIO“: is t eated as a e o de i e, a d a e loaded ith a BIO“ i itializatio ‘OM;

D‘AM: the de i e ep ese ts a DIMM odule that a e o figu ed i te s of e o t pe, de sities, a k, et . th ough the “PD i te fa e;

CPU: a e o figu ed i te s of CPU fa il , steppi g, et .

Roberto Giorgi, University of Siena, of 8642

“i No pa a ete s a e isualized i a g aphi al a – easie to a al ze a d/o ha ge o figu atio of the de i es ie → sho de i es .

Page 428: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

AMD “IM o Vi tualize “etup

Roberto Giorgi, University of Siena, of 8643

Page 429: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Si No XtoolsThe Guest O“ i the “i No a o u i ate ith the e te al o ld th ough Xtools. Xtools a e u i side the si ulated a hi e:

get <sou e_path> <dest_path> to t a sfe a file f o the host to the guest; put <sou e_path> <dest_path> to t a sfe a file f o the guest to the host;

A d also less used : si o t ol to o t ol si ulatio ; ti e; e ho;

Roberto Giorgi, University of Siena, of 8644

Page 430: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Si No additio al i stru e tsThe “i No allo s the use to de ug the ta get s ste e ploiti g a set of possi le diffe e t i st u e ts:

“i No i te al De ugge ; “i No de uggi g ia “e ial Po t; A al ze I te fa e to uild all a k fu tio s i the host COT“o

A Aete o -- the ai COT“o si ulatio -loop -- is o e of the Mo ito I te fa e to uild all a k fu tio s i the host ith e e o e

guest state i fo atio o e a u ate ut slo e

Roberto Giorgi, University of Siena, of 8645

Page 431: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Si No i ter al De uggerThe COT“o si ulato i f ast u tu e allo s the use to de ug the ta get s ste e ploiti g a set of possi le diffe e t i st u e ts:

“i No i te al De ugge ; “i No de uggi g ia “e ial Po t; A al ze I te fa e; Mo ito I te fa e;

A de i e that o e ts to a CPU odel ith the follo i g apa ilities:

“et eakpoi ts; Vie /Alte GP egiste s a d M“‘s; Vie /Alte li ea o ph si al e o ; Pe fo I/O o PCI o figu atio les;

Roberto Giorgi, University of Siena, of 8646

Page 432: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Si No de ugger

Roberto Giorgi, University of Siena, of 8647

Page 433: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Si No de uggi g ia serial portsThe COT“o si ulato i f ast u tu e allo s the use to de ug the ta get s ste e ploiti g a set of possi le diffe e t i st u e ts:

“i No De ugge ; “i No de uggi g ia “e ial Po t ; A al ze I te fa e; Mo ito I te fa e;

“e ial I/O po t a e i te fa ed usi g:

Na ed pipe:- ~/.si o / o /si o _i fo iti g- ~/.si o / o /si o _out fo eadi g

Th ough a se ial host po t e ui es supe iso p i ileges

Roberto Giorgi, University of Siena, of 8648

Page 434: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Si No de uggi g ia serial portsThe COT“o si ulato i f ast u tu e allo s the use to de ug the ta get s ste e ploiti g a set of possi le diffe e t i st u e ts:

“i No De ugge ; “i No de uggi g ia “e ial Po t; A al ze I te fa e; Mo ito I te fa e;

“e ial I/O po t a e i te fa ed usi g:

Na ed pipe:- ~/.si o / o /si o _i fo iti g- ~/.si o / o /si o _out fo eadi g

Th ough a se ial host po t e ui es supe iso p i iledges

The pipe is ope ed lau hi g the follo i g o a d i “i No Co a d Wi do :

“e ial: .“etCo Po t pipe

Roberto Giorgi, University of Siena, of 8649

Page 435: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Si No A al zer i terfa eThe COT“o si ulato i f ast u tu e allo s the use to de ug the ta get s ste e ploiti g a set of possi le diffe e t i st u e ts:

“i No De ugge ; “e ial Po t I te fa i g; A al ze I te fa e; Mo ito I te fa e;

A al ze s a e all a k fu tio s itte i C ode. The a al ze s a e d a i all li ked to the si ulato i o de to gathe statisti s.

“ee the a al ze “DK do u e tatio fo o e detail.

Roberto Giorgi, University of Siena, of 8650

Page 436: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

RUNNING COTSON EXAMPLES

51Roberto Giorgi, University of Siena, of 86

Page 437: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioLet s sta t ith a full fu tio al si ulatio :

No ti i g odels a e used; O l the fu tio al eha io is ep odu ed; A si ple a hi e o posed of CPU, GB of ai e o a d the Ka i -

Li u dist i utio is u .

o e_ ode_s ipt='fu tio al'displa =os.gete "DI“PLAY"

si o . o a ds=fu tiouse_ sd ' p. sd'use_hdd 'ka i .i g'set_jou al

e d

Roberto Giorgi, University of Siena, of 8652

Page 438: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioLet s sta t ith a full fu tio al si ulatio – o t d:

Assu e to e u de the otso i stallatio di e to : ./ otso /tru k/

Mo e to the e a ple di e to :

otso @ otso 1$ d ./sr /e a ples

‘u the si ulatio :

otso @ otso 1$ ake ru _fu tio al

Roberto Giorgi, University of Siena, of 8653

Page 439: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessio

The o a d fo u i g a e a ple is ake ru _<COTSo o figuratio >

Follo ed the a e of the Lua s ipt that o tai s the COT“o o figu atio .

Let s sta t ith a full fu tio al si ulatio – o t d:

Assu e to e u de the otso i stallatio di e to : ./ otso /tru k/

Mo e to the e a ple di e to :

otso @ otso 1 $ d ./sr /e a ples

‘u the si ulatio :

otso @ otso 1 $ ake ru _fu tio al

Roberto Giorgi, University of Siena, of 8654

Page 440: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioE er ise: Tr to ru the o e_ pu_si ple e a ple.

What is the diffe e e et ee the t o o figu atio s?

Is the output diffe e t f o p e ious e a ple?

Roberto Giorgi, University of Siena, of 8655

Page 441: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioE er ise: Tr to ru the o e_ pu_si ple e a ple.

What is the diffe e e et ee the t o o figu atio s?

Is the output diffe e t f o p e ious e a ple?

Fu tio al si ulatio :The e is a ope ed a ti e “i No i do he e ead the MIP“ fo the si ulated a hi e. No additio al i fo atio is p o ided.

Ti i g si ulatio : No a ti e “i No i do s appea ed i the output. The output of the si ulatio p o ides the i fo atio a out the pe fo a e easu e of the si ulated a hite tu e i te s of I stru tio s Per C le IPC .

Roberto Giorgi, University of Siena, of 8656

Page 442: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioE er ise: Tr to ru the o e_ pu_si ple e a ple.

What is the diffe e e et ee the t o o figu atio s?

Is the output diffe e t f o p e ious e a ple?

Fu tio al si ulatio :The e is a ope ed a ti e “i No i do he e ead the MIP“ fo the si ulated a hi e. No additio al i fo atio is p o ided.

Ti i g si ulatio : No a ti e “i No i do s appea ed i the output. The output of the si ulatio p o ides the i fo atio a out the pe fo a e easu e of the si ulated a hite tu e i te s of I stru tio s Per C le IPC . This i fo atio is a aila le o l he the ti i g si ulatio is e a led, si e the o putatio of the IPC e ui es a detailed odeli g of the ta get s ste .

Roberto Giorgi, University of Siena, of 8657

Page 443: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioA e e utio tra e is the esult of e o di g all the e e uted i st u tio s alo g

ith the e o a esses pe fo ed the ta get s ste .

The t a e is ge e all sto ed i a file, fo su essi e a al sis. The e e utio t a e is a i po ta t i st u e t to a al ze the eha io of the ta get s ste a d to a al ze the eha io of the e e uted appli atio .

The COT“o si ulatio i f ast u tu e allo s the use to e t a t a e e utio t a e, alo g ith the ai pe fo a e pa a ete s e.g., a he isses, u e of load/sto e i st u tio s, et . .

Roberto Giorgi, University of Siena, of 8658

Page 444: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioCo side i g the follo i g si ple hile-loop. appli atio :

#i lude <stdio.h>

i t ai oid{

i t e [ ] = { , , , , , , , , , };i t su = ;i t i = ;

hile i < {

su = su + e [i];i++;

}etu ;

}

Roberto Giorgi, University of Siena, of 8659

Page 445: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioLet s sta t odif i g the o e_ pu_si ple e a ple i o de to e t a t a e e utio t a e:

. Ope the o e_ pu_si ple.i lua file:otso @ otso 1 $ d COTSON/ otso -6 /tru k/sr /e a plesotso @ otso 1 $ gedit o e_ pu_si ple.i

. Modif i g the “i No o a d se tio i o de to load a d e e ute the si ple o piled hile-loop test p og a :

se d_ke oa d get /ho elo al/ otso /COT“ON/ hile-loop /ho e/use / hile-loop'se d_ke oa d h od + hile-loop; ./ hile-loop'

. Add the t a e logge to the CPU o figu atio :pu=get_ pupu:ti e { a e=' pu ',

t pe="t a e_stats", t a e_file="/ho elo al/ otso /COT“ON/o e_ pu_t a e.log" }

Roberto Giorgi, University of Siena, of 8660

Page 446: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioE er ise: A al ze the e e utio tra e

Ho a a hes a e i the t a e?

Ho a a he isses ha e ee happe ed i the e e utio ?

Roberto Giorgi, University of Siena, of 8661

Page 447: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioE er ise: A al ze the e e utio tra e

Ho a a hes a d ju ps a e the e i the t a e?

Ho a i st u tio s ha e ee e e uted?

The a h a d ju p i st u tio s a e ep ese ted i the t a e i st u tio s ith Op-Code ith the fo J .

Roberto Giorgi, University of Siena, of 8662

Page 448: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioE er ise: A al ze the e e utio tra e

Ho a a hes a d ju ps a e the e i the t a e?

Ho a i st u tio s ha e ee e e uted?

The pe fo a e log fo the ta get s ste is e o ded i the file:ode.1.o e_ pu_si ple.log

ti er. les gi es ou the total les of the si ulatio ti er.i stru tio s gi es ou the total u e of e e uted i st u tio s

The e e utio t a e is e o ded i the file see ou ho e di e to :o e_ pu_tra e.log

Roberto Giorgi, University of Siena, of 8663

Page 449: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

E peri e tal sessioE er ise: Co ple appli atio a al sis.

T to a al ze the eha io of the ta get s ste a hite tu e CPU he a o ple appli atio is e e uted.

As o ple efe e e appli atio e a use the jpeg i age o p esso appli atio , alo g ith the i put i age .pp .

Modif i g the “i No o a d se tio i o de to load a d u the jpegappli atio ith the i put i age.

Roberto Giorgi, University of Siena, of 8664

Page 450: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

PRACTICAL TEST

Roberto Giorgi, University of Siena, of 8665

Page 451: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COT“o i stallatio

• Follo the COT“ON U“E‘ GUIDE fo the ge e al i stallatio p o edu ehttp://sou efo ge. et/p/ otso / ode/HEAD/t ee/t u k/do /COT“ON_U“E‘_GUIDE- .pdfhe e e assu e that ou ha e i stalled COT“o i the otso di e to

• Do load: http:// .dii.u isi.it/~gio gi/tea hi g/hp a / etatools/ e h a k_ jpeg.ta .gz

• E a ples folde :~/ otso /s /e a ples/

• U o p ess CJPEG_e e ise i e a ples folde :$ ta zf e h a k_ jpeg.ta .gz -C ~/ otso /s /e a ples/

Roberto Giorgi, University of Siena, of 8666

Page 452: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

CJPEG p og a : e h a k• CJPEG p og a elo gs to li jpeg-tu o-utils: a utilities

fo a ipulati g JPEG i ages

• This e h a k o p esses the a ed i age file, o the sta da d i put if o file is a ed, a d p odu e JPEG file o the sta da d output. The u e tl suppo ted i put file fo ats a e: PPM, PGM, a d so o

• This e h a k eeds a INPUT jpeg i age a d p odu es a OUTPUT pp i age i ou ase

• The di e to jpeg_ e h a k o tai s i put a d e pe ted output files

Roberto Giorgi, University of Siena, of 8667

Page 453: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

CJPEG

PPM (192KB) JPEG(9.6KB)

Roberto Giorgi, University of Siena, of 8668

Page 454: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

CJPEG o pile a d e e utio

• Lau h the o plete e h a k ith:ake

- Co pa e files p odu ed ith those i the e pe ted_output di e to

Roberto Giorgi, University of Siena, of 8669

Page 455: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Ho to lau h CJPEG a uall

• If ou a t lau h jpeg a uall , ou a use:$ ./ jpeg < i put-la ge.pp > output_la ge.jpeg

– A e the esults diffe e t?– What it ea s?

Roberto Giorgi, University of Siena, of 8670

Page 456: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Ca he Co figu atio• E a ples a he o figu atio a e:

Linux commands:• vi <file name>: open the editor• i: insert mode• esc: esc for exiting insert mode• :wq write file and quit

L1 dcache Memory latency

Size Line size Num sets

A) 1KB 16 1 24

B) 32KB 16 1 100

• You a odif a he o figu atio i side lua e a ples file

Roberto Giorgi, University of Siena, of 8671

Page 457: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

“et a he pa a ete s• E a ple of a he o figu atio fo e o A:Mai e o- e =Me o { a e=" ai ", late = }

L a he:- l =Ca he{ a e="l a he", size=" kB",

li e_size= , late = , u _sets= , e t= e , ite_poli ="WB", ite_allo ate="t ue" }

L i st u tio a he:- i =Ca he{ a e="i a he", size=" kB", li e_size= ,

late = , u _sets= , e t=l , ite_poli ="WT", ite_allo ate="false" }

L data a he:- d =Ca he{ a e="d a he", size=" kB", li e_size= ,

late = , u _sets= , e t=l , ite_poli ="WT", ite_allo ate="false" }

Roberto Giorgi, University of Siena, of 8672

Page 458: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

What happe s?• T to lau h jpeg ith a he o figu atio A

a d a he o figu atio B o la ge i put

$ ake u _ jpeg_ e h a k_la ge_ e o A$ ake u _ jpeg_ e h a k_la ge_ e o B

– What happe s to iss ate?– What happe s to IPC? – Wh ?

Roberto Giorgi, University of Siena, of 8673

Page 459: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Me or A Me or BI put i put_la ge.pp I put_la ge.ppL d a he size kB kBCPU lesL d a he ite_ issCPU i st u tio s

“i ulatioL ead_ iss_ ate . .L ite_ iss_ ate . .I st u tio Pe C le . .

Ca he “tatisti s

Roberto Giorgi, University of Siena, of 8674

Page 460: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

S all I put Large I putL d a he size kB kBMai Me o A essL eadL d a he eadL d a he ite

“i ulatioI st u tio pe C le . .L ite iss ate . .L ead iss ate . .

Ca he “tatisti s

Roberto Giorgi, University of Siena, of 8675

Page 461: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Ca he “tatisti sAs e a see, ith a s all a he e oo figu atio e eed o e pu les a d e

ha e o e ead iss. Miss ate of L a he isette i la ge a he the i s alle as e e pe t.

I stead, he e use a la ge i put i age, eha e o e ai e o a ess a d o e

“i ulatio u e . I fa t la ge e o i put e ui es la ge CPU usage a d high u e of CPU

ope atio s.

Roberto Giorgi, University of Siena, of 8676

Page 462: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

DEFINING A REGION OF INTEREST

77Roberto Giorgi, University of Siena, of 86

Page 463: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

‘egio Of I te est ‘OI

• The e a e se e al ethods to defi e a ‘OI i COT“o , depe di g o the a u a that e

eed. The ‘OI sele tio is also alled otsot a e i COT“o te i olog :– E it.t igge otso t a e– E te al otso t a e– I te al otso t a e– A u ate otso t a e

78Roberto Giorgi, University of Siena, of 86

Page 464: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

optio s = {e it_t igge ="te i ate", …..}

si o . o a ds=fu tio…se d_ke oa d get jpeg.sh jpeg.sh;

h od + jpeg.sh ; ./ jpeg.sh s all A ; put jpeg.sh;

tou h ter i ate; put ter i atee d

e it.t igge t a e

79Roberto Giorgi, University of Siena, of 86

The si ulatio stops o e the te i ate file is itte f o guest to host. This is a fi stapp o i atio of the ‘OI i hi h e a egle t a fe s. “i ple to appl .

Page 465: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

otso _tra er is part of the guest tools, prei stalled i the BSD. " " sta d for a i ter al ode that is reser ed to a ti ate a

ethod to s it h et ee fu tio al a d ti i g ode sele ti e sa pler

$ otso _t a e ## s it h to ti i g sta t ‘OI

$ ./ jpeg < i put_s all.pp > output-s all-A.jpeg

$ otso _t a e ## a k to fu tio al e d ‘OI

E te al otso t a e

80Roberto Giorgi, University of Siena, of 86

This ethod tighte the ‘OI the e is a e o of so e us that a e egle ted fo lo g p og a s

Page 466: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

I te al otso t a e

• I t ai {i t a, , ;…..COT“ON_INTE‘NAL , , ; // sta t ‘OI <detailed egio of i te est>COT“ON_INTE‘NAL , , ; // e d ‘OI ……

}

81Roberto Giorgi, University of Siena, of 86

This ethod tighte the ‘OI the e is a s all e o of a fe i st u tio s that a e egle ted fo la ge egio s of i te est

Page 467: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

A u ate otso t a e

• I t ai {i t a, , ;…..AT‘ACE‘_“TA‘T ; // sta t ‘OI fo egio <detailed egio of i te est>AT‘ACE‘_“TOP ; // e d ‘OI fo egio ……

}

82Roberto Giorgi, University of Siena, of 86

This ethod is o e a u ate si e i te fe es less ith the ode good to test s alle a ples ith highe a u a

Page 468: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Tha ks !

Roberto Giorgi, University of Siena, of 8683

Page 469: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo si ulatio i frastru tureA o e detailed pi tu e of the si ulated s ste a d the elated set of COT“o ele e ts:

Roberto Giorgi, University of Siena, of 8684

Page 470: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo : Ti er ele e tsTi er ele e ts a e esse tiall odels fo oth CPUs a d othe s ste de i es:

A ept i st u tio s, p o ess the a d update the et i s;

All ti e s sha e the e o hie a h ;

Plugga le a hite tu e;

“e e al odels a e eated fo : CPUs; P ofili g; T a e ge e atio ; “i Poi t o ie ted a al sis;

Cu e t a aila le ti e t pes: Tra eStats – si ple li ear odel Ti er - si ple li ea odel ith a he hie a h ; Ti er - i -o de pipeli e odel ith a he hie a h ; Ba d idth – o l li ited the e o a d idth; PTLSi – out-of-o de pipeli e odel ith a he hie a h ;

Roberto Giorgi, University of Siena, of 8685

Page 471: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

COTSo : Sa pler ele e tsSa pler ele e ts a e i ha ge of de idi g he to all a ti i g odel a d fo ho u h ti e u the ti i g si ulatio :

Plugga le a hite tu e;

Ma possi le i ple e tatio s;

De ide he to ha ge the si ulatio state:

Fu tio al FN ; War i g WM ; Ti i g si ulatio TS ;

si ulatio ti e

FN WM TS

‘u the fast fu tio al si ulatio of the ta get

Put data i stateful st u tu es su h as a hes

“i ulate the ta get a ou ti g ti i g i fo

Roberto Giorgi, University of Siena, of 8686

Page 472: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

http://www.dii.unisi.it/~giorgi/teaching/hpca2High Performance Computer Architecture

Software Methods for ILP

1Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 29

Page 473: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Some Compiler Technologies• SOFTWARE BRANCH PREDICTION (in the compiler)

- Methods illustrated in the branch prediction lesson

• STATIC SCHEDULING + LOOP UNROLLING- Goal: improve the performance of pipeline and multiple-issue processors- These are the fundamental technique for static-issue processors

- i.e., for those processors which statically issue a group of instructionsthat are “packed” in a single long instructions

Very Long Instruction Word (VLIW) processers- Such techniques often improve the performance also in case of dynamic

issue (i.e., superscalar) processors-The impact of branches and dependences is reduced

• SOFTWARE PIPELINING

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 292

Page 474: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Static Scheduling – the objective• Find unrelated instructions to be interleaved among dependent instructions in order to hide latencies/stalls• In order to avoid stalls, given two dependent instructions (the

producer instruction P and the consumer instruction C) that need a latency L cycles to be processed in the pipeline in strict sequence

• the distance between P and C - in terms of clock cycles – could be filled up by other instructions (not dependent by P or causing dependences to C) with a total latency of at least L cycles

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 293

CP P to C Latencyt (cycles)

dependency

L=3

I1 I2 I3

Page 475: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Hypothesis for static reordering• The latency between the producer instruction P and the consumer instruction C varies a lot with the type of Functional Unit (FU) involved and by the specific dependency

• The compiler can (statically) reorder instructions if• There is enough ILP• The FU latencies are known to the compiler

- Such hypothesis creates a LINK between the software and microarchitecture that “weakens" the separation layer created by the Instruction Set !

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 294

Page 476: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Static Scheduling: pipeline hypothesis• 5-stage standard pipeline• Pipelined Functional Units

• (or replicated as many times as the operation latency)goal: to launch on operation of the given type per cycle

• No structural hazards

Producing Instr. Consumer Instr. CNSL

FP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op / INT ALU op 1Branch --- 1INT ALU op Branch 1Load double Store double 0INT ALU op INT ALU op 0

CNSL = No-StallLatency Cycles

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 295

Page 477: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Examplefor (k = 1000; k > 0; --k) x[k] = a + x[k];

• “Parallel Loop”: iterations are independent• Assembly translation:

Loop: L.D F0, 0(R1) ;F0=array elem.ADD.D F4,F0,F2 ;add scalar in F2S.D F4, 0(R1) ;store resultSUBI R1,R1,#8 ;decrement pointerBNEZ R1, Loop ;branch

R1 is initially 8000

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 296

Page 478: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Version-0: NON-SCHEDULED loop

Loop: L.D F0, 0(R1) 1Stall 2ADD.D F4,F0,F2 3Stall 4Stall 5S.D F4, 0(R1) 6SUBI R1,R1,#8 7Stall 8BNEZ R1, Loop 9Stall 10

10 cycles per iteration (5 stall sycles) Rewrite the codeto reduce the stalls

Cissue(clock cycles)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 297

Producing Instr. Consumer Instr. CNSL

FP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op / INT ALU op 1Branch --- 1INT ALU op Branch 1Load double Store double 0INT ALU op INT ALU op 0

Page 479: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Version-1: Scheduled loop

Loop: L.D F0, 0(R1) 1SUBI R1,R1,#8 2ADD.D F4,F0,F2 3Stall 4BNEZ R1, Loop 5 ;delayed branchS.D 8(R1),F4 6 ;altered effective address

6 cycles per iteration (1 stall cycle); observation: only 3 instructions really

operate on the vector x[.](L.D, ADD.D, S.D)

It’s possible to unroll the loop 4 times, to expose more ILP

for the static scheduling

i) Anticipate the SUBI, but need to change the offset of the store 0(.) 8(.)

ii) Move S.D after BNEZ by exploiting the branch delay-slot

Cissue(clock cycles)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 298

Producing Instr. Consumer Instr. CNSL

FP ALU op Another FP ALU op 3FP ALU op Store double 2 Load double FP ALU op / INT ALU op 1Branch --- 1INT ALU op Branch 1Load double Store double 0INT ALU op INT ALU op 0

Page 480: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Version-2: Unrolling 4 times the LoopL.D F0,0(R1)ADD.D F4,F0,F2S.D 0(R1),F4 ; drop SUBI&BNEZL.D F0,-8(R1)ADD.D F4,F0,F2S.D -8(R1),F4 ; drop SUBI&BNEZL.D F0,-16(R1)ADD.D F4,F0,F2S.D -16(R1),F4 ; drop SUBI&BNEZL.D F0,-24(R1)ADD.D F4,F0,F2S.D -24(R1),F4SUBI R1,R1,#32 ; alter to 4*8BNEZ R1,Loop

This loop requires 28 cycles (14 of which are stall cycles) per iteration:- each L.D has 1 cycle stall- ADD.D each has 2- the SUBI has one- the BNEZ it has 1I also have the 14 cycles of instruction-issue or 28/4 = 7 cycles to process each element of the array (slower the (scheduled) version-1) Rewrite the loop to reduce stalls

1 stall cycle2 stall cycles

Hypothesis: the total number of iterations is multiple of 4. OK in our case (1000 iterations)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 299

Page 481: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Removing the “Name Dependencies”L.D F0,0(R1)ADD.D F4,F0,F2S.D 0(R1),F4 ; drop SUBI&BNEZL.D F0,-8(R1)ADD.D F4,F0,F2S.D -8(R1),F4 ; drop SUBI&BNEZL.D F0,-16(R1)ADD.D F4,F0,F2S.D -16(R1),F4 ; drop SUBI&BNEZL.D F0,-24(R1)ADD.D F4,F0,F2S.D -24(R1),F4SUBI R1,R1,#32 ; alter to 4*8BNEZ R1,Loop

How to eliminate them?

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2910

Page 482: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Using STATIC register renamingL.D F0,0(R1)ADD.D F4,F0,F2S.D 0(R1),F4 ; drop SUBI&BNEZL.D F6,-8(R1)ADD.D F8,F6,F2S.D -8(R1),F8 ; drop SUBI&BNEZL.D F10,-16(R1)ADD.D F12,F10,F2S.D -16(R1),F12 ; drop SUBI&BNEZL.D F14,-24(R1)ADD.D F16,F14,F2S.D -24(R1),F16SUBI R1,R1,#32 ; alter to 4*8BNEZ R1,Loop

Register Renaming

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2911

Page 483: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Version-3: Unrolled Loop with less stalls

Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)ADD.D F4,F0,F2ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2S.D 0(R1),F4S.D -8(R1),F8SUBI R1,R1,#32 S.D 16(R1),F12BNEZ R1,LoopS.D 8(R1),F16 ;

This loop will run 14 cycles (no stalls) per iteration;

or 14/4=3.5 for each element!

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2912

Assumptions that make this possible:- move L.Ds before SDs - move S.D after SUBI and BNEZ- use different registers

When is it safe for compiler to do such changes?

Page 484: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Steps Compiler Performed to Unroll• Determine that is OK to move the S.D after SUBI and

BNEZ, and find amount to adjust S.D offset• Determine that unrolling the loop would be useful

by finding that the loop iterations were independent• Rename registers to avoid name dependencies• Eliminate extra test and branch instructions and adjust

the loop termination and iteration code• Determine loads and stores in unrolled loop can be

interchanged by observing that the loads and stores from different iterations are independent• requires analyzing memory addresses and finding that they

do not refer to the same address• Schedule the code, preserving any dependences needed

to yield same result as the original code

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2913

Page 485: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

VLIW PROCESSORS

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2914

Page 486: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

“Multiple Issue” PROCESSORS• Motivation: overcome the limitation of standard pipelines whereCPI REAL Pipeline = CPIIDEAL Pipeline +

CStructural stalls + CRAW stalls + CWAR stalls + CWAW stalls + CControl stalls

• Action point: (assuming the Cstalls are 0) go below the CPIIDEAL Pipeline(=1) by issuing more instructions per cycle

• Multiple Issue processor implementations:• Superscalar

- Exploits static scheduling (using the examined compiler technologies)- Exploits dynamic scheduling (using, e.g., Tomasulo’s algorithm)

• VLIW (Very Long Instruction Word)- Use only static scheduling ! (save hardware and power)- EVERY INSTRUCTIONS GROUPS OPERATIONS THAT GO IN PARALLEL

(EXPLICIT PARALLELSIM or EPIC Explicit Parallel Instruction Computer)Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2915

Page 487: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Static Scheduled Superscalar MIPS• Superscalar MIPS: issue 2 instruction per cycle

• 1 instruction goes in the Floating Point pipeline1 instruction goes in the Integer pipeline (or Load/Store and Branch)

• At every clock cycle, 2 instructions are fetched• The second instruction can be issued only if it goes in a different pipe

- If an instruction stalls than I should block the fetch

Time [clocks]MF D X W

MF D X W

MF D X W

Instr.

I

5 10

MF D X WFP

MF D X W

MF D X W

I

I

FP

FP

Note: the FP operations may have a longer X stage !

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2916

Page 488: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Version-4: Superscalar Loop Unrolling

Integer Instr. FP Instr.1 Loop: L.D F0,0(R1)

2 L.D F6,-8(R1)

3 L.D F10,-16(R1) ADD.D F4,F0,F2

4 L.D F14,-24(R1) ADD.D F8,F6,F2

5 L.D F18,-32(R1) ADD.D F12,F10,F2

6 S.D 0(R1),F4 ADD.D F16,F14,F2

7 S.D -8(R1),F8 ADD.D F20,F18,F2

8 S.D -16(R1),F12

9 SUBI R1,R1,#40

10 S.D 16(R1),F16

11 BNEZ R1,Loop

12 S.D 8(R1),F20

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2917

Unrolled 5 times to avoid delays

This loop will run 12 cycles (no stalls)

per iteration - or 12/5=2.4 for each

element of the array

Page 489: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Initial Multiple Issue Processors• Superscalar

• IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000- The number of scheduled instruction per cycle (also called ‘ways’) ranges

between 1 and 8- Static scheduling (compiler) or dynamic (e.g., Tomasulo)

• (Very) Long Instruction Words (V)LIW- Crusoe VLIW processor [www.transmeta.com]- Intel Architecture-64 (IA-64) 64-bit address- Majority of DSPs (Digital Signal Processor):

Texas Instrument C6000, PowerPC ... , ST Microelectronics ...- The number of operation in a SINGLE instruction (also called “bundle”)

varies from 4 to 16

Note: since we want CPI > 1 it’s more handy to use the IPC metric (Instructions Per Cycle)

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2918

Page 490: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

VLIW Implementation

FunctionalUnit

FunctionalUnit

Memory

Memory

Port

Port

to Memory

to Memory

from Memory

from Memory

I-fetch

& issue

Multi-Ported

Register

File

FunctionalUnit

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2919

Page 491: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

VLIW principles

•VLIWs directly use multiple independent functional units•VLIWs package the multiple operations into one very long instruction

•Compiler is responsible to choose instructions to be issued simultaneously

Time [clocks]

Instr.

Ii

Ii+1

IF

IF

ID

ID

EEE

EEE

W

W

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2920

Page 492: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Version-5: VLIW Loop Unrolling

Mem. Ref1 Mem Ref. 2 FP1 FP2 Int/Branch

1 L.D F2,0(R1) L.D F6,-8(R1)

2 L.D F10,-16(R1) L.D F14,-24(R1)

3 L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F0,F6

4 L.D F26,-48(R1) ADD.D F12,F0,F10 ADD.D F16,F0,F14

5 ADD.D F20,F0,F18 ADD.D F24,F0,F22

6 S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F0,F26 SUBI R1,R1,#56

7 S.D -16(R1),F12 S.D -24(R1),F16

8 S.D 24(R1),F20 S.D 16(R1),F24 BNEZ R1,Loop

9 S.D 8(R1),F28

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2921

Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per each element (1.8X)

Average: 2.5 ops per clock, 50% efficiency

Note: Need more registers in VLIW (15 vs. 11 in SS)

Page 493: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

“Multiple Issue” processor challanges• While Integer/FP split is simple for the HW,

get CPI of 0.5 only for programs with:• Exactly 50% FP operations• No hazards

• If more instructions issue at same time, greater difficulty of decode and issue• Even 2-scalar (2 op. per cycle) => examine 2 opcodes, 6 registers,

and decide if 1 or 2 instructions can issue• VLIW: trade off instruction space for simple decoding

• The long instruction word has room for many operations• By definition, all the operations the compiler puts in

the long instruction word are independent => execute in parallel• E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch

- 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide• Need compiling technique that schedules across several branches

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2922

Page 494: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Loop-Carried Dependence

• Let’s consider:for (i=0; i<8; i=i+1) {

A = A + C[i]; /* S1 */}

• We can exploit the “natural” parallelism that stems from the associative operator:”Cycle 1”: temp0 = C[0] + C[1];

temp1 = C[2] + C[3];temp2 = C[4] + C[5];temp3 = C[6] + C[7];

”Cycle 2”: temp4 = temp0 + temp1;temp5 = temp2 + temp3;

”Cycle 3”: A = temp4 + temp5;

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2923

Page 495: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

The iterations may NOT be easily unrollable !• Let’s consider the following code(A, B, C are distinct and non-overlapped in memory):

for (i=0; i<100; i=i+1) {A[i+1] = A[i] + C[i]; /* S1 */B[i+1] = B[i] + A[i+1]; /* S2 */

}

• We have the following dependencies:1) S2 uses the value, A[i+1], computed by S1 in the same iteration 2) S1 uses the value, A[i], computed by S1 itself in an earlier

iteration (when it was named A[i+i])3) Same dependence in S2 for B[i] computed by the previous

iteration by S2 itself when it was B[i+1]• Even if we unroll, the dependencies create stalls

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2924

Page 496: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Method to reduce the loop-carried dependencies

• Let’s consider this other code:

• Let’s transform it by overlapping part of the iterations:

• Now it’s possible to unroll without loop-carried dependencies

for (i=1; i<=100; i=i+1) {A[i] = A[i] + B[i]; /* S1 */B[i+1] = C[i] + D[i]; /* S2 */

}

A[1] = A[1] + B[1];for (i=1; i<100; i=i+1) {

B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];

}B[101] = C[100] + D[100];

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2925

Page 497: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Another possibility: Software Pipelining

•Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations

•Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop• Someone call this to a sort of Software-Tomasulo

Iteration 0 Iteration

1Iteration

2 Iteration 3

Software-pipelined iteration

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2926

Page 498: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Software Pipelining benefits• Increases the distance between production and consumption of result

• Use code more compact than unrolling• Only one "transitory" filling and draining of the pipeline (compared to unrolling where it happens at each iteration)

SW Pipeline

Loop Unrolled

over

lapp

ed o

ps

Time

Time

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2927

Page 499: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Version-6: Software Pipelining

3 S.D 16(R1),F4 ; Store X[i]13 SUBI R1,R1,#8

5 ADD.D F4,F0,F2 ; Add X[i-1]14 BNEZ R1,LOOP

7 L.D F0,8(R1); Load X[i-2]

5 cycle per iterationand per element

V-2: 4-times nrolling1 L.D F0,0(R1)2 ADD.D F4,F0,F23 S.D 0(R1),F4 4 L.D F0,-8(R1)5 ADD.D F4,F0,F26 S.D -8(R1),F4 7 L.D F0,-16(R1)8 ADD.D F4,F0,F29 S.D -16(R1),F410 L.D F0,-24(R1)11 ADD.D F4,F0,F212 S.D -24(R1),F413 SUBI R1,R1,#3214 BNEZ R1,LOOP

L.D F0,0(R1)ADD.D F4,F0,F2L.D F0,-8(R1)SUBI R1,R1,#16

S.D 16(R1),F4

ADD.D F4,F0,F2

S.D 8(R1),F4

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2928

Page 500: High Performance Computer Architecturegiorgi/teaching/... · 13,14: Bus architecture API=Application Program Interface ... • MFLOPS/GFLOPS: MILLION/BILLION OF FLOATING POINT OPERATIONS

Summary TableVERSION CYCLES x

ITERATIONCYCLES x ELEMENT

SPEEDUP COMMENTS

V-0 10 10 1 Basic VersionV-1 6 6 1.67 Instr. SchedulingV-2 28 7 1.42 Loop Unrolling (LU) x4V-2a 44 6.29 1.59 Loop Unrolling (LU) x7V-3 14 3.5 2.85 Instr. Sch. + LUx4V-3a 23 3.29 3.04 Instr. Sch. + LUx7V-4 12 2.4 4.17 Superscalarx2 (LUx5)V-4a 16 2.29 4.37 Superscalarx2 (LUx7)V-5 7 1.3 7.69 VLIWx5 (LUx7)V-6 5 5 2 Software Pipelining

Roberto Giorgi, Universita' degli Studi di Siena, C218LEZ08-SL di 2929