introduction to computer architecturecs.hac.ac.il/staff/martin/architecture/00_arch_slides.pdf ·...

408
1-1 Dr. Martin Land Introduction Computer Architecture — Hadassah College — Spring 2019 Introduction to Computer Architecture

Upload: others

Post on 11-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-1Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Introduction to

Computer 

Architecture 

Page 2: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-2Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Computer Architecture

From Wikipedia, the free encyclopedia

In computer engineering, computer architecture is a set of 

rules and methods that describe the functionality, 

organization, and implementation of computer systems.  

Some definitions of architecture define it as describing 

the capabilities and programming model of a computer 

but not a particular implementation.  In other definitions 

computer architecture involves instruction set 

architecture design, microarchitecture design, logic 

design, and implementation.

What is Computer Architecture

Page 3: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-3Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Computer Architecture

Wikipedia

In computer engineering, computer architecture is a set of rules and methods that describe the functionality, organization, and implementation of computer systems.  Some definitions of architecture define it as describing the capabilities and programmingmodel of a computer but not a particular implementation.  In other definitions computer architecture involves instruction set architecture design, microarchitecture design, logic design, and implementation.

Translation:

computer architecture = { rules and methods | describe

Functionality— system capabilities and programming model

Organization— instruction set architecture, microarchitecture 

Implementation— logic design

}

What is Computer Architecture

Page 4: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-4Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

PerformanceLow run time

Fast programs

Low latency No waiting between programs and operations

Low energy consumptionLow electric billsLong battery lifeNo overheating

Market factorsLow cost (in relation to realistic demand for devices)Reliable manufacture and deliveryProfitability

Computer ArchitectureWhat rules and methods?

Page 5: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-5Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Computing Platform by ApplicationWorkstation applications

Office, basic number crunching, graphics, gamingA few sequential loop-oriented threadsTypical CPU — Intel x86 (2 to 16 cores)

Mobile applicationsLow power version of workstationTypical CPU — ARM (1 to 4 cores)

Online Transaction Processing (OLTP)Banking, order processing, inventory, student information systemThousands of independent SQL transactions with memory latencyTypical CPU — SPARC (64 to 256 cores)

Supercomputer applicationsHeavy number crunching, data miningThousands of separable sequential loop-oriented threads Typical CPU — IBM Power (up to 512 Kcores)

Page 6: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-6Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Mainframe + Virtualization + CloudMainframe

120 CPU cores + 3840 GB RAM + 8 GB/s I/O + reliabilityReplaces 10 to 1000 serversComplex partitioning

Allocate hardware subsystems as neededMultiple independent operating systems

Server VirtualizationSoftware over OS partitions hardware resources Multiple guest operating systems over OS

Cloud computingProvider sells standard system interface as a service

Infrastructure as Service, Platform as Service, Software as Service

Customer sees system specified in contractProvider handles operations+administration+maintenance (OAM)

Page 7: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-7Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Introductionto

Performance 

Page 8: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-8Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Basic DefinitionsPerformance (ב יצועי ם)

Processing speed Performance measures

Response time ( ז מ ן תגובה)Elapsed time from start to finish of a defined task

Run Time ( ז מ ן ריצ ה)Response time for a start to finish program task

Latency ( ז מ ן ה מת נה)Excess response time — depends on context

Throughput (תפוקה)Number of defined tasks performed per unit time

Speedup (שיפור)

1old run time            new run time old run time

new run timeS S > ⇒= <

Page 9: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-9Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Run Time and Clock CyclesCPU is timed by periodic signal called a Clock

Clock Cycle (CC) measured in seconds per cycleClock Rate = cycles per second = Hz (Hertz)Instruction requires 1 or more clock cycles to process

Higher clock rate ⇒ shorter run time

Fewer clock cycles (at constant clock rate) ⇒ shorter run time

clockcycle

clock cycles to run program seconds per clock cycles

clock cycles to run programclock cycles per second

= ×

=

Run time      

 

Page 10: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-10Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Speedup and Clock Rate

Speedup follows from Higher clock rateFewer program clock cycles

Improvements to codeStructural improvements in hardware

old o

new ne

ld

old

w

ol

ne

d

w

new

new

old

program clock cycles ×seconds per clocprogram clock cycles ×seconds per clock cycle

program clock cyclesclo

program clock cyclesclock rate

ck rate

program clock c

=

=

=

k cycle

S= TT

new

new

old

old×clock rate

program clockycles

clock rat les ecyc

Page 11: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-11Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Factors Affecting Run TimeCPU hardware

Hardware → average clock cycles (CC) required per instruction

Memory (RAM + cache)Quantity and organization affects data availability

Internal communication and I/OSpeed and organization affects data availability

Operating system efficiencyCPU devotes less time to dense OS codeOS manages tasks/threads to keep hardware busy

CompilerConverts high level language to machine codeOptimized code runs faster

Special hardwareDedicated processors (graphics, memory management)

Application codeEfficient algorithms, data structures, parallelization

Page 12: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-12Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Examples of Factors

Affecting Performance

Page 13: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-13Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

CPU Hardware Example —Multiple Core ProcessorsN Core Symmetric Multiprocessor (SMP)

N complete CPUs on one chipDivide work among N processors

Each CPU has multiple Execution Units (EU)ALU operates on integersFPU operates on float / doubleVector processor operates on long registers

OS assigns threads to each coreIf program threads are separableIf data structures are not too entangled

Registers

ExecutionCore (ALUs)

Cache

MainMemory

I/O BusPCI Bridge

CPU 0 CPU 1

Registers

ExecutionCore (ALUs)

Cache

Dual CoreProcessor

Page 14: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-14Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

CPU Hardware Example —Vector ProcessorVector Processor

SIMD — Single Instruction Multiple DataPerforms same operation on 4, 8, or 16 bytes in parallel

No carry/borrow between bytes

Example64-bit Source and Destination registers PARALLEL_ADD on 8 pairs of byte operands

SRC0…7 + DEST0…7 = DEST0…7SRC8…15 + DEST8…15 = DEST8…15

…SRC56…63 + DEST56…63 = DEST56…63

SRC 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 +

DEST 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0

DEST 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0

Page 15: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-15Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Memory Example —Hybrid Data StructureGraphic array

200 vertex points = 25 groups of 8 wordsHybrid Data Structure for efficient vector processing

Coordinates and colors Stored in separate data structuresStructures handled in CONCURRENT threads on separate CPUs

Coordinatesstruct { float x[8], y[8], z[8] ; } H_xyz[25] ;

8-word group loaded and processed as vector on CPU 0Each loop updates 8 x-coordinates, then 8 y's, then 8 z's

Colorsstruct { float r[8], g[8], b[8] ;} H_rgb[25] ;

8-word group loaded and processed as vector on CPU 1Each loop updates 8 reds, then 8 greens, then 8 blues

Page 16: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-16Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Memory Example — Color Data StructureAddressing in 32-bit processors

Processor sends 32-bit aligned address A (multiple of 4)Reads 4-byte word — bytes from addresses A, A+1, A+2, A+3

Access to individual byte requires reading entire dword

24-bit True Color3 color bytes — Red, Blue, Green28 = 256 levels per color (0x00 – 0xFF)Most 24-bit colors split between dwordsAccess to pixel color ⇒ 2 memory cycles

32-bit True ColorPad 24-bit color with blank byteAlign color data on 32-bit addresses One memory cycle per pixel

dword dword dword R G B R G B R G B R G B

1 cycle 2 cycles 2 cycles 1 cycles

dword dword R G B — R G B —

Page 17: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-17Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Compiler Efficiency Example

main(){ int i,j; for (i = 0; i < 10; i++){ j = 2 * i; } } 0000 MOV WORD PTR [BP-02],0000 ; i = 0 0005 CMP WORD PTR [BP-02],+0A 0009 JGE 0018 ; break on i ≥ 10 000B MOV AX,[BP-02] ; AX ← i 000E SHL AX,1 ; AX ← 2 * AX 0010 MOV [BP-04],AX ; j ← AX 0013 INC WORD PTR [BP-02] ; i++ 0016 JMP 0005 ; loop 0018 RET

C code compiled inefficiently for Intel 8086 processor

Page 18: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-18Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Page from Intel 8086 Manual

80186/80188 HIGH-INTEGRATION 16-BIT MICROPROCESSORS,COPYRIGHT © INTEL CORPORATION, 1995

Clock Cycles per Instruction

Page 19: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-19Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Program Timing for 8086

Instruction 8086 Clock Cycles (CC)

MOV WORD PTR [BP-02],0000 MOV imm to r/m 4/13

start: CMP WORD PTR [BP-02],+0A CMP r/m,imm 3/10

JGE stop Jcc (not taken/taken) 4/13

MOV AX,[BP-02] MOV r/m to reg 2/9

SHL AX,1 Shift reg 2

MOV [BP-04],AX MOV reg to r/m 2/12

INC WORD PTR [BP-02] INC r/m 3/15

JMP start JMP 14

stop: RET RET 16

Loop control instructions

ALU instructions

Setup/takedown instructions (run once)

Instruction timings are given in 8086 manual (in clock cycles)

Program contains

Page 20: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-20Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Program Run Time

N = number of loop iterationsTotal clock cycles = 13 + N × 10 + (N – 1) × (4 + 9 + 2 + 12 + 15 + 14) + 13 + 16

= 66 × N – 14

For N = 11 (stop on i = 10), Total CC = 712

Instruction 8086 Clock Cycles (CC)

MOV WORD PTR [BP-02],0000 13 CC (runs once)

start: CMP WORD PTR [BP-02],+0A 10 CC on each loop

JGE stop 4 CC on all loops but last 13 CC on last

MOV AX,[BP-02] 9 CC on all loops but last

SHL AX,1 2 CC on all loops but last

MOV [BP-04],AX 12 CC on all loops but last

INC WORD PTR [BP-02] 15 CC on all loops but last

JMP start 14 CC on all loops but last

stop: RET 16 CC (runs once)

Page 21: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-21Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Example —More Efficient Compilation

Total clock cycles = 4 + N × 3 + (N – 1) × (4 + 2 + 2 + 2 + 3 + 14) + 13 + 16= 30 × N + 6

For N = 11 (stop on i = 10), Total CC = 337

Using register variables requires large number of registers

Instruction 8086 Clock Cycles (CC)

MOV SI,0000 4 CC (runs once)

start: CMP SI,+0A 3 CC on each loop

JGE stop 4 CC on all loops but last 13 CC on last

MOV AX,SI 2 CC on all loops but last

SHL AX,1 2 CC on all loops but last

MOV DI,AX 2 CC on all loops but last

INC SI 3 CC on all loops but last

JMP start 14 CC on all loops but last

stop: RET 16 CC (runs once)

712

3372.11S ==

Store Variables in Registers —Not Memory

Page 22: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-22Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Example — Even More Efficient Compilation

Total clock cycles = 4 + N × (2 + 2 + 2 + 3 + 3) + (N – 1) × 13 + 4 + 16= 25 × N + 11

For N = 10 (stop on i = 10), Total CC = 261

Instruction 8086 Clock Cycles

MOV SI,0000 MOV imm to reg 4

start: MOV AX,SI MOV reg to reg 2

SHL AX,1 SHIFT reg 2

MOV DI,AX MOV reg to reg 2

INC SI INC reg 3

CMP SI,+0A CMP reg,imm 3/10

JL start Jcc (not taken/taken) 4/13

stop: RET RET 16

712

2612.73S ==

Rebuild Loop

Page 23: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-23Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Measuring

Performance

Page 24: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-24Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

BenchmarksDefinition

Collection of programs for measurement and comparison of system performance

Requirements Standard and scientific

Consistent result on repeated testsConsistent result by anyone repeating tests

Test system in realistic wayReflect statistically representative use of

Instruction typesData typesLoop lengthOS and compiler conditions

Summarize data so comparisons make sense

Page 25: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-25Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

SPEC BenchmarkPrograms for system performance measurement + comparison

Standard + repeatable Test system for realistic conditionsSummary score for easy comparisonResults posted at http://www.spec.org/

Specific test suitesCint — CPU integer instructionsCfp — CPU FP instructionsPerformance as file server, web server, mail server, graphics

Updated every few years to reflect realistic conditionsBased on current statistical distributions of computing tasksCurrent CPU test version — 2017

Previous version — 2006

Reports speedupRun time compared with a standard machine

Page 26: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-26Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

How SPEC WorksUser runs n programs on test machine

Records run-time conditionsRecords program run-time in seconds

SPEC provides run-times on reference machineSun Fire V4902100 MHz UltraSPARC-IV+ processorPowerful symmetric multiprocessing (SMP) server (2006 – 2014)

User calculates speedup for each program

User calculates geometric mean of speedups

, 1, 2,...,ref

ii test

ii n

TTS ==

, 1, 2,...,testi i nT =

refiT

( )

( ) ( )( )

1

test machine on ref

machine A on refmachine A compared to machine B

machine B on ref

i

nn refitest

i

TT

S

SS S

=

⎡ ⎤⎛ ⎞⎢ ⎥⎜ ⎟⎜ ⎟⎢ ⎥⎝ ⎠⎣ ⎦

=

= ∏

Page 27: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-27Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Typical Reference Run Times

Cint2017 Programs

Program  Language KLOC  Application  Ref Run Time 

600.perlbench_s   C   362   Perl interpreter  1773 

602.gcc_s   C   1,304   GNU C compiler  3982 

605.mcf_s   C   3   Route planning  4709 

620.omnetpp_s   C++   134  Discrete Event simulation ‐ computer network  

1630 

623.xalancbmk_s   C++   520   XML to HTML conversion via XSLT  1413 

625.x264_s   C   96   Video compression  1770 

631.deepsjeng_s   C++   10  Artificial Intelligence: alpha‐beta tree search (Chess) 

1434 

641.leela_s   C++   21  Artificial Intelligence: Monte Carlo tree search (Go) 

1706 

648.exchange2_s   Fortran   1  Artificial Intelligence: recursive solution generator (Sudoku) 

2948 

657.xz_s   C   33   General data compression  6188 

KLOC = 1000 lines of code  

 

Page 28: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-28Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Typical SPEC Report — 1

Base = standard configuration Peak = specialist configuration

SPEC(R) CPU2017 Integer Speed Result ASUSTeK Computer Inc.

ASUS RS700-E9(Z11PP-D24) Server System (2.70 GHz, Intel Xeon Gold 6150)

CPU2017 License: 9016 Test date: Dec-2017 Test sponsor: ASUSTeK Computer Inc. Hardware availability: Jul-2017 Tested by: ASUSTeK Computer Inc. Software availability: Sep-2017 Base Base Base Peak Peak Peak Benchmarks Thrds Run Time Ratio Thrds Run Time Ratio -------------- ------ --------- --------- ------ --------- --------- 600.perlbench_s 72 286 6.22 72 239 7.42 602.gcc_s 72 423 9.42 72 413 9.65 605.mcf_s 72 426 11.1 72 421 11.2 620.omnetpp_s 72 257 6.35 72 248 6.58 623.xalancbmk_s 72 150 9.46 72 140 10.1 625.x264_s 72 150 11.8 72 150 11.8 631.deepsjeng_s 72 280 5.11 72 282 5.08 641.leela_s 72 393 4.34 72 392 4.36 648.exchange2_s 72 220 13.4 72 220 13.4 657.xz_s 72 280 22.1 72 277 22.3 SPECspeed2017_int_base 8.87 SPECspeed2017_int_peak 9.16

Page 29: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-29Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Typical SPEC Report — 2 HARDWARE -------- CPU Name: Intel Xeon Gold 6150 Max MHz.: 3700 Nominal: 2700 Enabled: 36 cores, 2 chips Orderable: 1, 2 chip(s) Cache L1: 32 KB I + 32 KB D on chip per core L2: 1 MB I+D on chip per core L3: 24.75 MB I+D on chip per chip Other: None Memory: 768 GB (24 x 32 GB 2Rx4 PC4-2666V-R) Storage: 1 x 240 GB SATA SSD Other: None SOFTWARE -------- OS: Red Hat Enterprise Linux Server release 7.3 (x86_64) Kernel 3.10.0-514.el7.x86_64 Compiler: C/C++: Version 18.0.0.128 of Intel C/C++ Compiler; Fortran: Version 18.0.0.128 of Intel Fortran Compiler Parallel: Yes Firmware: Version 0601 released Oct-2017 File System: xfs System State: Run level 3 (multi-user) Base Pointers: 64-bit Peak Pointers: 32/64-bit Other: jemalloc: jemalloc memory allocator library V5.0.1

Page 30: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-30Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Some Cint2017 Results

Processor  Clock (GHz) 

Total Chips 

Total Cores 

Total Threads

Cint 2017 Base 

Cint 2006 Base 

Ratio 

Intel Xeon Gold 6146  3.2  2  24  24  10.1  83.0  8.21 

Intel Xeon Gold 6146  3.2  4  48  48  9.95  85.7  8.61 

Intel Xeon Platinum 8153 

2.0  4  64  64  7.00  62.8  8.97 

Intel Xeon Bronze 3104  1.7  2  12  12  4.20  68.5  16.31 

Intel Xeon Platinum 8180 

2.5  8  224  224  9.37  81.6  8.71 

Intel Core 2 Duo E6850 with auto parallel 

3.0  1  2  2  —  19.9  — 

Intel Core 2 Duo E6850 with no auto parallel 

3.0  1  2  1  —  18.7  — 

Page 31: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-31Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Some Comments on Cint2017 ResultsAuto parallel

High level Cint code not threaded for parallel processingAuto parallel compiler creates parallel threads using heuristicsProvides limited speed up (or even degradation)All CPU results in table use auto parallel except last

Intel Xeon Gold 6146 with 3.2 GHz clockFastest CPU in Cint2017 tests2 chips (24 threads) slightly faster than 4 chips (48 threads)

Communication between more threads can slow processing4 chips faster on Cint2006 (using different benchmark programs)

Intel Xeon Platinum 8152 with 2.0 GHz clockCint with 64 threads = 7.00With 3.2 GHz clock, expect Cint = 7 x 3.2 GHz / 2.0 GHz = 11.2Not much better than Gold 6146 with 24 threads

Core Duo E6850 — old processor not tested on Cint2017Cint2006 with 1 threads (no auto parallel) = 18.7Cint2006 with 2 threads (auto parallel) = 19.9 = 6% speed up

Page 32: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-32Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Representative Cint2006 Results

Sponsor  Processor  Clock (GHz)

Auto Parallel

Total Chips 

Total Cores 

Total Threads 

Base 

Hypertechnologies  Intel Core i7‐5960X  4.5  Yes  1  8  8  79.7 

Supermicro  Intel Core i7‐6700K  4.4  Yes  1  4  4  77.4 

NEC   Intel Xeon E3‐1270  3.6  Yes  1  4  4  74.2 

Huawei  Intel Xeon E5‐2699  2.2  Yes  2  44  44  74.0 

Supermicro  Intel Core i5‐6600  3.3  Yes  1  4  4  71.0 

Dell  Intel Xeon E5‐2699  2.2  Yes  2  44  88  70.5 

Intel  Intel Core 2 Duo E6850  3.0  Yes  1  2  2  21.3 

Intel  Intel Core 2 Duo E6850  3.0  No  1  2  1  20.2 

Dell  Pentium 4 670  3.8  No  1  1  1  11.5 

Intel  Intel Pentium M 780  2.3  No  1  1  1  10.7 

Page 33: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

1-33Dr. Martin LandIntroductionComputer Architecture — Hadassah College — Spring 2019

Actual Sources of Performance Improvement1978 Clock speed of 8086 is 4 MHz2008 Xeon (clock speed of 4 GHz) is 100,000 times faster

Clock speedup = 4 GHz / 4 MHz = 1000Structural speedup = 100,000 / 1000 = 100

Reducing waiting time between operationsPerforming operations in parallel

No more clock speedupPentium 4 clock rate (4 GHz) = 4 x Pentium III clock (1 GHz)Clock speedup 1 GHz → 4 GHz required structural slowdown

Pentium 4 at 1 GHz slower than Pentium III at 1 GHzRun Pentium III at 4 GHz ⇒ melt CPU

Clock speed → physical limit of about 10 GHzSignal takes clock cycle to cross Pentium 4 at speed of light

Future speedup comes from structural improvementsMore coresBetter architectures

Page 34: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-1Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Instruction Set Architecture

Choosing Ingredients 

for a Computer Design

Page 35: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-2Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

סקיר ת ה פרק

מהו מעבדvon Neumannמבנה

שלבי תכנון מעבדקבוצת הפקו דותמבנה פקודה

)נתונים(אופרנדים שמירת נתוני ם וסו ג י זכרון

פעולו ת שיקול ים לתכנון קבוצת הפקודות

)CISC(שפה מורכבת מימוש פקודות בחומרה

microcode

Page 36: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-3Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Von Neumann ArchitectureStored-Program Digital Computer

• Digital computation in ALU• Programmable via set of

standard instructions• Internal storage of data • Internal storage of program• Automatic Input/Output• Automatic sequencing of

instruction execution bydecoder/controller

ArithmeticLogicUnit

(ALU)

input memory output

controller

data/instruction path

control pathVon Neumann Architecture

Data and instructions stored in a single memory unitHarvard Architecture

Data and instructions stored in a separate memory units

Page 37: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-4Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Stages in Computer DesignInstruction Set Architecture (ISA)

1. Look at universe of problems to be solved2. Define atomic operations at level of system programmer

• Small and orthogonal operations (each performs different task)• Can be combined to perform any operation

3. Specify instruction set for machine language• Choose a minimum set of basic operations• Not too many ways to solve same problem

Implementation1. Design machine as implementation of ISA2. Evaluate theoretical performance3. Identify performance problem areas4. Improve processor efficiency

Page 38: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-5Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Instruction Features Instruction

Description of an Operation performed on Operands

Operations Actions performed on data

OperandsSources — data inputs to operationDestinations — data outputs from operationSpecified by

Addressing Mode — location of data in machineData Type —Integer, Long, Floating Point, Decimal, String, Constant,

etc.

Page 39: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-6Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Instruction Set Architecture General Instruction is instance of data structure

Machine Language is range of data structure Instruction for Operation ∈ {legal actions} Operand ∈ {legal Addressing Modes}

Describe sources and destinations

Typical machine instructionADD destination, source_1, source_2destination ← source_1 + source_2

Operand...OperandOperandOperation

Page 40: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-7Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Instruction DefinitionsOperations and operands

unary — one source operandbinary — two source operandsn-ary — n source operands

Address specifier Describes address format

Addressing modeOperation model

Data width

Intel Non-Intel 2 bytes word half-word 4 bytes dword word 8 bytes quadword doubleword

Page 41: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-8Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Memory Hierarchy

Long TermStorage

Main Memory(RAM) Cache Register

All Filesand Data

Running Programsand Data

Next FewInstructionsand Data

CurrentData

Memory location inside CPU

Fast access to small amount of

information

Organized by CPU

Memory location in or near CPU

Fast access to important data and

instructions from RAM

Copy of RAM section

Memory location outside CPU

Stores "all" data and instructions of running programs

Organized by addresses

Memory locations outside CPU and RAM

Stores data and instructions of "all"

programs

Organized by OS

Page 42: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-9Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Register NamingRegisters are part of CPU design

Information stored in registers called architectural stateDescribes machine status and program status

General Purpose (GP) registersHold data for instructions

Width of data is width of standard integer in CPU

Referenced by names or numbersIntel x86: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP, EIPGeneral: R0, R1, … , R127

Special Purpose registersMachine status registersOperating system registers

Page 43: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-10Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Flat Memory Organization

N-bit address space

Physical Address = AN-1 AN-2 … A1 A0

Can form 2N addresses, from 0 to (2N – 1)

Every byte in RAM has N-bit address

Processor refers to memory locations by physical RAM addresses

Processor stores memory addresses in N-bit address registers

Data Byte 11111…111 Data Byte 11111…110 Data Byte 11111…101 Data Byte 11111…100

… … Data Byte 00000…111 Data Byte 00000…110 Data Byte 00000…101 Data Byte 00000…100 Data Byte 00000…011 Data Byte 00000…010 Data Byte 00000…001 Data Byte 00000…000 Memory Location Address

memory addresses

N-bit register

CPU

Page 44: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-11Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Word Organization in Memory

Word orderLittle endian

Least significant byte stored at lower addressWord is stored "little-end-first"Example: 4-byte word 69 b3 36 7d stored as

Big endianMost significant byte stored at lower addressWord is stored "big-end-first"Example: 4-byte word 69 b3 36 7d stored as

AlignmentRequirement that address of s-byte data unit be multiple of sFormally — address A % s = 0

8086 requires segments to be aligned on 16-byte boundariesIA-32 requires pages to be aligned on 4 KB boundaries

stored byte 69 b3 36 7d address 07 06 05 04 03 02 01 00

stored byte 7d 36 b3 69 address 07 06 05 04 03 02 01 00

Page 45: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-12Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

ImmediateConstant = IMM = numerical value coded into instruction

Register operands

register name = a CPU storage locationREGS[register name] = data stored in registerREGS[R3] = data stored in register R3 = 11223340

Memory operands

address = a memory storage locationMEM[address] = data stored in memoryMEM[11223344] = data stored at address 11223344 = 45

Effective Address (EA) — pointer arithmetic

REGS[R3] ← &(variable)MEM[REGS[R3]+4] = *(&(variable)+4) = *(REGS[R3]+4)

= *(11223340+4) = 45

Specifying Operands

11223340

R3

45

11223344

Page 46: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-13Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Structured Operation ModelsDefines basic arithmetic procedure and ALU organization

Stack

Z = X + Y → push Xpush YADDpop Z

AccumulatorAll operations use accumulator AZ = X + Y → load X

add Y

store Z

Push Pointer ← Pointer – d Stack[Pointer] ← memory/register

Pop memory/register ← Stack[Pointer] Pointer ← Pointer + d

Binary Op

Stack[Pointer + d] ← Stack[Pointer + d] Op Stack[Pointer] Pointer ← Pointer + d

Stack ALU used in Java bytecode

Accumulator ALU used in hand calculator

Page 47: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-14Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

General Register Operation ModelsRegister-Memory Model

Operands can be stored in any REGISTER or MEMORY locationZ = X + Y → load R1, X

add R1, R1, Ystore Z, R1

Register- Register ModelMEMORY operands must be loaded to a REGISTER

Also called LOAD-STORE MODELZ = X + Y → load R1, X

load R2, Yadd R1, R1, R2store Z, R1

Easier to implementStatistically, most loaded operands are used more than once

Page 48: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-15Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Typical Addressing ModesMode Syntax Memory Access Use

Register R3 Regs[R3] Register data Immediate #3 3 Constant Direct (absolute)

(1001) Mem[1001] Static data

Register deferred

(R1) Mem[Regs[R1]] Pointer

Displacement 100(R1) Mem[100+Regs[R1]] Local variable Indexed (R1 + R2) Mem[Regs[R1]+Regs[R2]] Array addressing Memory indirect

@(R3) Mem[Mem[Regs[R3]]] Pointer to pointer

Auto Increment

(R2)+ Mem[Regs[R2]] Regs[R2] ← Regs[R2]+d Stack access

Auto Decrement

-(R2) Regs[R2] ← Regs[R2]-d Mem[Regs[R2]]

Stack access

Scaled 100(R2)[R3] Mem[100+Regs[R2]+Regs[R3]*d] Indexing arrays PC-relative (PC) Mem[PC+value]

PC-relative deferred

1001(PC) Mem[PC+Mem[1001]]

Store data relative to program counter (instruction address)

Page 49: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-16Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Typical OperationsData transfer

Load (r ← m), store (m ← r), move (r/m ← r/m), convert data types

Arithmetic/Logical (ALU)Integer arithmetic (+ – × ÷ compare shift) and logical (AND, OR, NOR, XOR)

DecimalInteger arithmetic on decimal numbers

Floating point (FPU)Floating point arithmetic (+ – × ÷ sqrt trig exp …)

StringString move, string compare, string search

ControlConditional and unconditional branch, call/return, trap

Operating System System calls, virtual memory management instructions

GraphicsPixel operations, compression/decompression operations

Page 50: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-17Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Classic Computer Organization

Page 51: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-18Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Considerations in Classic Computer DesignExpensive memory

RAM ~ $5000/MB wholesale in 1977

Poor compilersNon-optimizingBad error messagingFast code written or optimized in assembly language

Semantic Gap ArgumentBelief among theoreticians in 1960s and 1970Computer language should imitate natural language

Large vocabularyHigh redundancy

Page 52: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-19Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Implications for Machine Language Machine Language should be high level

Language defines many instructionsEach instruction performs a lot of workLanguage defines many addressing modes

AdvantagesAssembly language programming is easier Each stored instruction in memory more powerfulMore power per instruction requires less memory

Page 53: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-20Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Classic Machine Design

CISC (Complex Instruction Set Computer)

300+ instruction types

15+ addressing modes

10+ data types

Automated procedure handling

Complex machine implementations

Page 54: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-21Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

CISCCISC was conventional wisdom in 1960s and 1970s

MainframesLarge and expensive computersOwned by big businesses and governmentsManufacturers: IBM, Control Data, Burrows, HoneywellFrom 1960s to 1980s, mainframes were CISC machines

MinicomputersSmaller computers for smaller organizationsManufacturers: Digital (PDP/VAX), Data General (Eclipse)Promoted academic computer science, smaller operating systems

(Unix), computer networking

MicrocomputersIntel designed the 8086 (1979) to work like a tiny VAXThe PC is the only CISC computer still manufactured

Page 55: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-22Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Physical Implementation

Main Memory

Registers

MAR MDR+PCIRDecoderStatusWord

Address Data PC - program counterIR - instruction register

MAR - memory address registerMDR - memory data register

ALU Subsystem

System Bus

INOUT

ALU Operat ion

1

23

A LU Result F lagcontrol

Page 56: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-23Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

RegistersGeneral Registers

R0 … Rn-1

Register width is standard integer in ISAPC

Program Counter Holds address of next instruction to execute

IRInstruction Register Holds binary code of instruction being executed

MARMemory Address Register Holds physical address for RAM access

MDRMemory Data Register Holds data during Read/Write memory operations

Page 57: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-24Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Device Communication

A device WRITES with OE = 1 and READS with IE = 1Von Neumann controller distributes OE and IE signals to devices

Bus: A vehicle for carrying many passengers

Device 1

Device 3

Device 2

Write

OE

Read

IE

Write

OE

Read

IE

Write

OE

Read

IE

Syst

em B

us

Device BRead

IE

Device AWrite

OE

Page 58: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-25Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Atomic Operations ⎯ Instruction Fetch(1) MAR ← PC(2) READ(3) IR ← MDR(4) PC ← PC + length(instruction)

Page 59: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-26Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

(1) MAR ← PC

Main Memory

Registers

MAR MDR+PCIRDecoderStatusWord

Address DataPC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Operat ion

1

23

ALU Result F lagcontrol

Page 60: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-27Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

(2) READ

Main Memory

Registers

MAR MDR+PCIRDecoderStatusWord

Address DataPC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Opera tion

1

23

A LU Result Flagcontrol

Page 61: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-28Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

(3) IR ← MDR

Registers

MAR MDR+PCIRDecoderStatusWord

Address DataPC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Operat ion

1

23

ALU Result F lag

Main MemoryAddress Data

control

Page 62: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-29Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

(4) PC ← PC + length(instruction)

Registers

MAR MDR+PCIRDecoderStatusWord

PC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Opera tion

1

23

A LU Result Flag

Main MemoryAddress Data

control

Page 63: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-30Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Atomic OperationsInstruction: SUB R1, R2, 100(R3) ALU_IN ← R3ALU ← 100ADDMAR ← OUTREADALU_IN ← MDRALU ← R2SUBR1 ← OUT

Page 64: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-31Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

SUB R1, R2, 100(R3): ALU_IN ← R3

Registers

MAR MDR+PCIRDecoderStatusWord

PC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Opera tion

1

23

A LU Result Flag

Main MemoryAddress Data

control

R3

Page 65: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-32Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

SUB R1, R2, 100(R3): ALU ← 100

Registers

MAR MDR+PCIRDecoderStatusWord

PC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Opera tion

1

23

A LU Result Flag

Main MemoryAddress Data

control

R3

100

Page 66: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-33Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

SUB R1, R2, 100(R3): ADD

Registers

MAR MDR+PCIRDecoderStatusWord

PC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Opera tion

1

23

A LU Result Flag

Main MemoryAddress Data

control

R3

100 100+R3

Page 67: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-34Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

SUB R1, R2, 100(R3): MAR ← OUT

Registers

MAR MDR+PCIRDecoderStatusWord

PC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Opera tion

1

23

A LU Result Flag

Main MemoryAddress Data

control

100+R3

Page 68: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-35Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

SUB R1, R2, 100(R3): READ

Registers

MAR MDR+PCIRDecoderStatusWord

PC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Opera tion

1

23

A LU R esult Flag

Main MemoryAddress Data

control

100+R3

Page 69: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-36Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

SUB R1, R2, 100(R3): ALU_IN ← MDR

Registers

MAR MDR+PCIRDecoderStatusWord

PC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Opera tion

1

23

A LU Result Flag

Main MemoryAddress Data

control

(100+R3)

Page 70: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-37Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

SUB R1, R2, 100(R3): ALU ← R2

Registers

MAR MDR+PCIRDecoderStatusWord

PC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Opera tion

1

23

A LU Result Flag

Main MemoryAddress Data

control

(100+R3)

R2

Page 71: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-38Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

SUB R1, R2, 100(R3): SUB

Registers

MAR MDR+PCIRDecoderStatusWord

PC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Opera tion

1

23

A LU Result Flag

Main MemoryAddress Data

control

(100+R3)

R2 R2-100(R3)

Page 72: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-39Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

SUB R1, R2, 100(R3): R1 ← OUT

Registers

MAR MDR+PCIRDecoderStatusWord

PC- program counterIR- instruction register

MAR- memory address registerMDR- memory data register

ALU Subsystem

System Bus

INOUT

ALU Opera tion

1

23

A LU R esult Flag

Main MemoryAddress Data

control

R2-100(R3)

Page 73: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-40Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Decoding Machine InstructionsMachine Language Instruction

SUB R1, R2, 100(R3)

Microcode Instruction Sequence (Microprogram)ALU_IN ← R3ALU ← 100ADDMAR ← OUTREADALU_IN ← MDRALU ← R2SUBR1 ← OUT

Page 74: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-41Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

MicrocodeMicrocode

One line of microprogram Implementation-level atomic operationAtomic ⇒ operation must complete before servicing interrupt

Decoder"Interprets" machine language instruction into microprogramDecoder ROM stores microprogram for every legal instructionNew instruction ⇒ add microprogram to decoder

Microprogram is sequenced by decoderState machine for each instructionEach state provides control signals to every subsystemEach line of microcode is executed in the correct order

Based on work of Maurice V. Wilkes (1951)

Page 75: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-42Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

Clock Cycles Per InstructionClock Cycle (CC)

Determined by length of longest microcode operationsOne line of microcode finishes before next line begins

Most microcode lines finish in one clock cycleMemory access may take several clock cycles

Clock Cycles Per InstructionMachine language instruction implemented as lines of microcodeClock Cycles Per Instruction = number of microcode lines

Memory accesses may take extra clock cycles

Clock cycles for program = number of microcode lines in program

( ) ( )program Instruction of type

instruction types

CC Instructions of type CC i

ii

== ×∑

Instruction type — same basic microcode structure

Page 76: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

2-43Dr. Martin LandInstruction Set ArchitectureComputer Architecture — Hadassah College — Spring 2019

CISC Creates Anti‐CISC Revolution 1974 — 1977

Data General introduces Eclipse 32-bit CISC minicomputerDigital (DEC) introduces VAX 32-bit CISC minicomputerFirst serious inexpensive competition to mainframe computers

1977 — 1990Serious computers became available to small organizationsUNIX developed as minicomputer operating systemTCP/IP developed to support networks of minicomputersComputer Science emerged as separate academic disciplineStudents needed topics for projects, theses, dissertations

1980 — 1990Research results on minicomputer performance

CISC uses machine resources inefficientlyMost machine instructions are rarely used in programsCISC machines run slowly to support unnecessary features

Page 77: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-1Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Quantitative Performance 

Theory

Page 78: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-2Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Amdahl Equation for MultiprocessorsSymmetric Multiprocessor (SMP)

N equivalent microprocessors Communication network between processorsOperating system runs on one+ processorOS assigns tasks to processors by some scheduling system

Amdahl equation for SMP

ArchitecturalState

ExecutionCore

Cache

MainMemory

I/O BusPCI Bridge

CPU 0 CPU 1

ArchitecturalState

ExecutionCore

Cache

CPU 2

ArchitecturalState

ExecutionCore

Cache

CPU 3

ArchitecturalState

ExecutionCore

Cache

( )1

1

fraction of work that can be enhanced (parallelized)

speedup for part to be enhanced (number of processors)P

PP

FS F NF

N

==

=− +

Quad Core CPU

Page 79: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-3Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example of Amdahl EquationFor multiprocessor system

Typical small Dell file serverN = 8 Xeon processorsFp = 80% of work can be parallelized

If number of processors were unlimited

( )1 1 3.330.80 0.20 0.101 0.80

8

S = = ≅+− +

( )

1 1 1 51 1 0.801

NP P

P

SF FFN

→∞= ⎯⎯⎯→ = =− −− +

Maximum speedup is 5Future enhancements

require more parallelization Fp

( )1

1

fraction of work that can be enhanced (parallelized)

speedup for part to be enhanced (number of processors)P

PP

FS F NF

N

==

=− +

Page 80: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-4Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Basic Performance MeasuresRun Time ( זמן רי צה)

Elapsed time T from start to finish of a defined program task

Latency (זמ ן ה מתנה)Excess response time — depends on context

Throughput (תפוקה)Number of defined tasks performed per unit time

Enhancement (שינוי מבנה)Change to system ⇒ new run time T '

Speedup ( שיפור)

'1 ' <

TT

T TS S > ⇒=

1=

+Throughput

T latency between tasks

Page 81: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-5Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Processor Performance EnhancementsHardware Enhancements

Clock rateInstruction implementationMemory organizationNumber of processing elements (CPUs, ALUs, registers)

Software EnhancementsRun time optimizationsCompilerOperating system

Enhanced Run TimeRun time = sum of partial run timesEnhancement ⇒ partial run times are longer, shorter, or unchangedS > 1 ⇒ Sum of new partial run times < sum of old partial run times

Page 82: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-6Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Run Time Enhancements

total run time

partial run timecan be enhanced

partial run timecannot be enhanced

enhanced total run time

enhancedpartial

run time

unchangedpartial

run time

enhancement

Page 83: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-7Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Amdahl EquationDefinitions

T total run time of a taskT' total run time of a task after enhancementte partial run time that can be enhancedte' partial run time that can be enhanced, after enhancementt0 partial run time that cannot be enhancedFe fraction of run time that can be enhanced = te / TSe Speedup of portion of run time that can be enhanced = te / te'

0 0

1 1 11' ' ' ' 1e e e e e

e eee

T TST t t t t T t t t F F

ST T T Tt

= = = = =+ − − ++ +

Amdahl equation expresses speedup in terms of relative quantitiesActual run-times not needed if RELATIVE ENHANCEMENTS are known

t0 tet0 te'

T

T'

Page 84: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-8Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example of Amdahl EquationProgram partial run times

T

tFP

tINT

400 msTotal Run Time

300 msFloat Instructions

100 msInteger Instructions

Enhance partial run time of Float Instructions

T'

tFP'tINT

300 msTotal Run Time

200 msFloat Instructions

100 msInteger Instructions

400 4300' 3

Speedup from actual run times

msms

S TT

= = =

300 3 75%400 4300 3 1.50200 2'

ms ms ms ms

e

eFP

FP

FP

F

SttTt

= = = =

= = = =

1 1 1 43 1 1 331 41 4 234 2

Speedup from relative enhancements

ee

e

S FFS

= = = =+− + − +

Page 85: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-9Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Application of Amdahl EquationOn some CPU

Float (FP) instructions account for 50% of total run timeSquare root (FP) accounts for 20% of total run time

Choose between two alternative enhancementsSpeedup of Se = 2 for all FP instructionsSpeedup of Se = 10 for square root instruction

Enhancement 1

Enhancement 2

11 1 1 1.331 0.50 0.50 0.251 1 0.50

2e ee

SF F

S

= = = ≅ ⇒+− + − +

33% speedup

21 1 1 1.221 0.20 0.80 0.021 1 0.20

10e ee

SF F

S

= = = ≅ ⇒+− + − +

22% speedup

Page 86: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-10Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Generalized Amdahl Equation

New definitionstd portion of run time that is degradedtd' portion of run time that is degraded,

after degradationFd fraction of run time that is degraded = td / TSd Speedup of portion of run time that is degraded = td / td'

( )0

1 1' '' ' ' 1 e de d e e d de d

e de de d

T TS F FT t t t t t tT t t t F FS ST t T t T

= = = =− −+ + − + + ++ +

Result of reasonable architectural changeEnhancements to most featuresDegradations to some featuresOverall enhancement

t0 te

t0 te'T'

td

t'd

T

Page 87: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-11Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Amdahl's "Law"To make good architectural improvements

Focus on enhancements that positively affect most featuresIgnore degradations that negatively affect few features

Example — simple "RISC" processor 94% of run time is 5 times faster than CISC processor1% of run time is 10 times slower than CISC processor5% of run time is same as for CISC processor

This RISC processor is (overall) about 3 times faster than CISC

Even though some operations are slower

( )

( )

1

1

10.94 0.011 0.94 0.01 15 10

1 2.940.05 0.19 0.10

e de d

e d

S F FF FS S

=− + + +

=− + + +

= ≅+ +

Page 88: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-12Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Detailed Analysis of CPU Run TimesAmdahl equation requires relative run time data

Run time data requires measurements on running programsMeasurements on running programs require CPU implementation

CPU analysis predicts run time without building CPUAssumptions:

Instructions can be grouped together according to resource usageExample — ADD R1, R2, R3 and SUB R1, R2, R3

All instructions in a group run in same number of clock cyclesEvery clock cycle measures same unit of timeInstruction run time = clock cycle time × number of clock cyclesGroup run time = instruction run time × instructions in group Total run time = sum of instruction group run times

Page 89: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-13Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Definitions

i

i

i

i

i

Tt iICCPI

===

=

total run time of program

total run time of instructions in group

number of instructions in group ( nstruction ount)

number of clock cycles to run 1 instruction in group ( ycles

I C

C Per

1

iN i

R

IC

τ

τ

=

=

= = = = =

=

nstruction)

number of clock cycles to run all instructions in group

seconds per clock cycle

clock rate clock frequency clock cycles per second Hertz (Hz)

total number of instructions i

I

n pr

'

NCPIquantity quantity

==

=

ogram

total number of clock cycles to run program

average number of clock cycles per instruction for the program

new value of after architectural change

Page 90: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-14Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

CPU Equation

Clock cycles to run all instructions of type i

×clock cycles

instruction of type instructions of type i i ii

iN IC CPI= = ×

Total clock cycles to run all instructions in program

i i ii i

N N IC CPI= = ×∑ ∑all groups

Average number of clock cycles per instruction for program

1 1 ii i i i

i i i

NCPIIC

ICCPI N IC CPI CPIIC IC IC

= =

= = × = ×∑ ∑ ∑

total number of clock cycles to run programtotal number of instructions in program

Ratio iICIC

is proportion (percent) of instructions in group i

1i

i

ICIC

=∑

weighted average

Page 91: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-15Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example of CPU Equation

Program distribution

121,000Branch

CPIiICiInstruction Type i

4,000

5,000

8Load / Store

4Integer

12 × 1000 = 12,000 cycles1000/10000 = 10%Branch

Ni = ICi × CPIiICi / ICInstruction Type i

4000/10000 = 40%

5000/10000 = 50%

8 × 4000 = 32,000 cyclesLoad / Store

4 × 5000 = 20,000 cyclesInteger

5,000 1,000 4,00010,000

int branch load/store

instructions

IC IC IC IC= + +

= + +=

20,000 12,000 32,000 64,000cycles cycles cycles cyclesN = + + =

/ 64,000 /10,000 6.4cycles instuctions cycles per instructionCPI N IC= = =

4 0.50 12 0.10 8 0.40 6.4 cycles per instructionii

i

ICCPI CPI

IC= × = × + × + × =∑

Page 92: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-16Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

CPU Run Time

Run time of one instruction of type i

iiCPI τ= ×clock cycles seconds

×instruction of type clock cycle

Run time for all instructions of type i

clock cycles seconds× ×

instruction of type clock cycleinstructions of type i

i i

it i

IC CPI τ

=

= × ×

Total run time for program

all groups

ii i i i

i i i

ICT t CPI IC CPI ICIC

τ τ⎛ ⎞= = × × = × × ×⎜ ⎟⎝ ⎠

∑ ∑ ∑

So =

clock cycles per instruction number of instructions clock cycle

T CPI IC τ× ×= × ×

Page 93: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-17Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

CPU Run Time — ExampleFor a certain CPU

Instructions in a typical programs can be grouped as50% integer ALU instructions that run in 8 clock cycles10% float ALU instructions that run in 20 clock cycles20% load instructions that run in 10 clock cycles10% store instructions that run in 15 clock cycles10% branch instructions that run in 10 clock cycles

The clock speed is 100 MHzA typical program runs 1,000,000 instructions

Running 500,000 ALU instructions, 100,000 FP instructions, 100,000 loads, …

The average number of cycles per instruction is

The typical program runs in

8 0.5 20 0.1 10 0.2 15 0.1 10 0.1 10.5ii

i

ICCPI CPIIC

= × = × + × + × + × + × =∑

6

8

10.5 10 0.10510

CPI ICT CPI ICR

τ × ×= × × = = =

cycles/instruction instructionsseconds

Hz

Page 94: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-18Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

C Code to Runtime Example — 1High level code

int x = 0 , n = 0 , a[5] ;while ( n < 5 ){

x = x + a[n] ;n++ ;

}

Assembly program1000 MOV R1, 0 load 11002 MOV R2, 2000 load 1 13%1004 ADD R1, R1, (R2)+ ALUAI 5 29%1008 CMP R2, 2020 ALU 5 29%1012 JL 1004 JMP 5 29%

IC = 17 100%

compile+

optimize

Page 95: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-19Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

C Code to Runtime Example — 2Assembly code

ADD R1, R1, (R2)+

MicroprogramALU_IN, MAR ← R2ALU ← 4ADDR2 ← ALU_OUTREADALU_IN ← MDRALU ← R1ADDR1 ← ALU_OUT

interpretto

microcode

CPIALU-autoinc = 9

Page 96: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-20Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

C Code to Runtime Example — 3Assembly program

1000 MOV R1, 0 ICi CPIi1002 MOV R2, 2000 load 13% 21004 ADD R1, R1, (R2)+ ALUAI 29% 91008 CMP R2, 2020 ALU 29% 31012 JL 2004 JMP 29% 12

IC = 17 100%

Average CPI

Total clock cycles

Run time with 1 GHz clock rate

2 0.13 9 0.29 3 0.29 12 0.29 7.2ii

i

ICCPI CPIIC

= × = × + × + × + × =∑

7.2 17 122N CPI IC= × = × =

9 7122 10 seconds 1.22 10 seconds 0.122 microsecondsT N − −= ×τ = × = × =

Page 97: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-21Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Applying CPU Equation

''

'

' ' ' '

' ' ' '

Calculate 

Calculate

1. Run time before enhanceme

 

Calculat

nt

2. Characterize enhancement

3. Run time after enhancement

4. Speed p

u

T CPI IC

ICCPI

T CPI IC

T CPI ICST CPI IC

τ

τ

τ

ττ

= × ×

= × ×

× ×= =

× ×

Page 98: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-22Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

CPU Equation — Example 2For a certain CPU

25% of all instructions in programs are float (ICFP / IC = 0.25)FP group includes ADD, SUB, MULT, DIV, SQRT

Average FP instruction runs in 4 clock cycles (CPIFP = 4)

2% of all instructions in program are square root (ICSQRT / IC = 0.02)SQRT (FP) instruction runs in 20 cycles (CPISQRT = 20)

Average CPI for all other instructions in program is 4/3 clock cycles

(ICother / IC = 1 – 0.25 = 0.75 CPIother = 4/3)

Average cycles per instruction

( )44 0.25 1 0.25 2.003

ii

i

ICCPI CPIIC

= × = × + × − =∑

Page 99: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-23Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example 2 — 2Two possible enhancements

1. Improve performance of all FP instructionsEnhance average CPIFP = 4 cycles to CPIFP' = 2 cyclesNo change in program ⇒ ICi' = ICi for all instruction typesNo change to clock rate ⇒ τ' = τ

2. Improve performance of SQRT (FP) instructionEnhance CPISQRT = 20 cycles to CPISQRT' = 2 cyclesNo change in program ⇒ ICi' = ICi for all instruction typesNo change to clock rate ⇒ τ' = τ

To evaluate enhancements, must find CPI'

Page 100: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-24Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example 2 — 3Enhancement 1

Improve average FP from CPIFP = 4 cycles to CPIFP' = 2 cycles

( )

'' ''

''' '' '

40.25 1 0.253

1.50

2.00 1.33' ' ' ' ' ' 1.50

2

ii

i

FPFP

ICCPI CPIIC

ICICCPI CPIIC IC

T CPI IC CPI IC CPIST CPI IC CPI IC CPI

τ ττ τ

= ×

= × + ×

= × + × −

=

× × × ×= = = = = ≅

× × × ×

∑other

other

Page 101: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-25Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example 2 — 4Enhancement 2

Improve square root (FP) from CPISQRT = 20 cycles to CPISQRT' = 2 cyclesMust separate into 3 instruction groups

FP/SQRT = FP group without SQRT = ADD, SUB, MULT, DIVSQRTAll other instructions

First calculate CPIFP/SQRT from CPIFP , CPISQRT , ICFP / IC , ICSQRT / IC

//

//

'' ''

' ' '' ' '' ' '

'

otherother

otherother

ii

i

FP SQRT SQRTFP SQRT SQRT

FP SQRT SQRTFP SQRT SQRT

ICCPI CPIIC

IC IC ICCPI CPI CPIIC IC IC

IC IC ICCPI CPIIC IC I

CPIC

= ×

= × + × + ×

= × + × + ×

// 25%, / 2% / 25% 2% 23%FP SQRT FP SQRTIC IC IC IC IC IC= = ⇒ = − =

Page 102: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-26Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example 2 — 5

/

4

/

FP

kFP

k FPFP FP

k k kk

k FP k FPFP FP

FP SQRT

CPI CPI FP SQRTNNFP

FP IC ICCPI IC ICCPI

IC IC

CPI CPI FP SQRT

FP SQRTFP

∈ ∈

= =

= = =

×= = ×

=

=

∑ ∑

total cycles

instructions

total cycles

Average for group with

Average for group without

/

// /

/ // /

/FP SQRT k

k FP SQRTFP SQRT FP SQRT

k k kk

k FP SQRT k FP SQRTFP SQRT FP SQRT

N NSQRT IC IC

CPI IC ICCPIIC IC

∈ ∈

= =

×= = ×

∑ ∑

instructions

Page 103: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-27Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example 2 — 6

/

/

/

/

/

/

/

/ 1 // 1/

/

FP S

kFP k

k FP FP

SQRTkk SQRT

k FP SQRT FP FP

SQRTkk SQRT

k FP SQRT FP FP

kk

k FP S

QRT

FP SQRT

FP SQR

Q

T

FP SQRT RT

ICIC

IC ICIC I

ICCPI CPI

ICICIC

CPI CPIIC IC

ICICCPI CPI

IC IC

ICC

ICI

C

IPI

C

= ×

= × + ×

⎡ ⎤ ⎡ ⎤= × × + × ×⎢ ⎥ ⎢ ⎥

⎣ ⎦⎣ ⎦⎡ ⎤

= ×⎢ ⎥⎢ ⎥⎣ ⎦

/

/

/

/

// /

/ // /

0.25 0.02 0.024 20 2.610.25 0.25

SQRTSQRT

FP FP

SQRTFP SQRT SQRT

FP

FP FP

FP SQRT FP S

SQRT

QRT

ICCPI

IC IC

ICCPI CPI

IC I

C ICIC IC

IC ICC

CPI CPI

C II

I CC

+ ×

= × + ×

−= × + × ⇒ =

Page 104: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-28Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example 2 — 7Speedup for Enhancement 2

( )

//

'' '

'

'

42.61 0.23 0.02 1 0.253

1.64

2.00 1.22

2

' ' ' ' ' ' 1.64

otherother

ii

i

FP SQRT SQRTFP SQRT SQRT

ICCPI CPI

ICIC IC IC

CPI CPI CPIIC IC IC

T CPI IC CPI IC CPIST CPI IC CPI IC CPI

τ ττ τ

= ×

= × + × + ×

= × + × + × −

=

× × × ×= = = = = ≅

× × × ×

Page 105: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-29Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example 2 — 8

( )' '

''

''

''

i ii i

i i

i ii i

i

CPI CPI CPI CPI

IC ICCPI CPI CPI

IC ICIC IC

CPI CPI CPIIC IC

= − −

⎛ ⎞= − × − ×⎜ ⎟

⎝ ⎠⎛ ⎞= − × − ×⎜ ⎟⎝ ⎠

∑ ∑

Trick — technique to avoid calculating CPIFP/SQRT

( )

( )

( )

'

' '

'

2.00 20 2 0.02

1.64

i i

ii i

i

SQRTSQRT SQRT

IC ICICCPI CPI CPI CPIIC

ICCPI CPI CPI

IC

=

⎡ ⎤= − − ×⎢ ⎥⎣ ⎦⎡ ⎤

= − − ×⎢ ⎥⎣ ⎦⎡ ⎤= − − ×⎣ ⎦

=

If then combine terms as

Page 106: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-30Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example 2 — 95. Speedups

Enhancement to square root — S = 1.22

Enhancement to all FP — S = 1.33

Results identical to analysis by Amdahl equationCan derive inputs to Amdahl equation from CPU analysis

( )

( )

4 0.2550%

24 2

' ' ' ' 2

20 0.0220%

2

' '

FP FP FP FPe

FP FP FP FPe

FP FP FP

SQRT SQRT SQRTSQRTe

SQRT SQRTSQRTe

SQRT SQR

ICt CPI ICFT CPI IC ICCPI IC CPIS

CPI IC CPI

t CPI IC ICF

T CPI IC ICCPI IC

SCPI IC

τττ τ

ττ

τ ττ τ

τ

× × ×× ×= = = =

× × × ×× ×

= = = =× ×

× × × × ×= = = =

× × × ×× ×

20 10' ' 2

SQRT

T SQRT

CPICPIτ

= = =×

1.

2.

Page 107: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-31Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Changing Instruction Mix — 1Program distribution

10000/10000 = 100%10,000Total

4000/10000 = 40%

1000/10000 = 10%

5000/10000 = 50%

ICi / IC

121,000Branch

CPIiICiInstruction Type i

4,000

5,000

8Load / Store

4Integer

4 0.50 12 0.10 8 0.40 6.4 cycles per instructionii

i

ICCPI CPI

IC= × = × + × + × =∑

New program distribution

8000/8000 = 100%8,000Total

4000/8000 = 50%

1000/8000 = 12.5%

3000/8000 = 37.5%

ICi / IC

121,000Branch

CPIiICiInstruction Type i

4,000

3,000

8Load / Store

4Integer

4 0.375 12 0.125 8 0.50 7.0 cycles per instructionii

i

ICCPI CPI

IC= × = × + × + × =∑

Page 108: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-32Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Changing Instruction Mix — 2

'

' ' '6.4 100007.0 8000

1.14

SpeedupTSTCPI IC

CPI ICττττ

=

× ×=

× ×× ×

=× ×

=

Page 109: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-33Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

The Instructions Per Second MythMeasures often used to describe computer power

MIPS = million instructions per secondFLOPS = floating point operations per second

Neither gives fair comparison

Example

CPU-1 and CPU-2 Run ALU instructions in 1 cycles and others in 2 cyclesHave clock speed of 1 GHz

CPU-1 compiler produces 50% ALU instructions and 50% otherCPU-2 compiler produces 25% fewer ALU instructions than CPU-1

6

6 6 6

1010 10 10

IC IC RT CPI IC CPIτ

= = = =× × × × ×

instructions / MIPS

run time

9

1 1 6

101 0.50 2 0.50 1.50 66710 1.50

million instructions / secHzCPI MIPS= × + × = ⇒ = ≅

×

Page 110: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-34Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

MIPS — 2For CPU-2

( )1 1

2 1 1 1

2 2 2 1 1 1

2 1 2 1

2 1 2 1

2

2

0.50

0.75 0.75 0.50 0.375

0.375 0.50 0.875

0.375 0.500.43 0.57

0.875 0.875

1 0.43 2 0.57 1.57

ALU

ALU ALU

ALU other

ALU other

IC IC

IC IC IC IC

IC IC IC IC IC IC

IC IC IC ICIC IC IC IC

CPI

MIPS

= ×

= × = × × = ×

= + = × + × = ×

× ×= = = =

× ×

= × + × =9

16

10 63710 1.57

million instructions / secHz MIPS<= ≅

×

955.0667637

1

2 ==MIPSMIPS

Page 111: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-35Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

MIPS — 3Run time comparison

( )

1

2

1 1 1

2 2 2

1 1

2 2

1

1

1.501.57 0.8751.09

TSTCPI ICCPI ICCPI ICCPI IC

ICIC

ττ

=

× ×=

× ××

×=

× ×

MIPS is about 5% lower for CPU-2 than CPU-1CPU-2 is about 9% faster than CPU-1

Page 112: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-36Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Replacing Instruction TypesInstruction count

IC = IC1 + IC2 + ... + ICn

ExamplesType 1 = ALUType 2 = Conditional Branch

New Instruction countReplace 2 ALU instructions + 1 Branch

DEC CXCMP CX, 0JNZ target

New instructionLOOP target

IC' = IC1' + IC2' + ... + ICn

Page 113: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-37Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Example — Replacing InstructionsA certain CPU has no floating point unit (FPU)

Performs FP calculations by EMULATION

Converts FP operations to integer operationsExample

(2.165 × 104) × (3.247 × 10-3) → 2165 × 3247exp = (4 – 3) + (-3 – 3)

Instruction distribution

210%Branch

25%Store

210%Load

175%ALU

CPIiICi / ICType i

( )1 0.75 2 0.10 0.05 0.101.25

ii

i

ICCPI CPIIC

= ×

= × + × + +

=

Page 114: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-38Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Replacing Instructions — 2Enhance CPU with ALUReplace ALU instructions that emulate FP with new FP instructions

2/3 of old ALU instructions emulate FP instructions1 new FPU instruction replaces 10 old ALU emulation instructionsNew FPU instructions run in 4 clock cycles

210%Branch

25%Store

210%Load

175%ALU

CPIiICi / ICType i

} 2/3 × 75% = 50%ALUemulation

1/3 × 75% = 25%ALUint{IC'ALU = 0.25 ICIC'FPU = 1/10 × 0.50 IC = 0.05 ICIC'load = ICload = 0.10 ICIC'store = ICstore = 0.05 ICIC'branch = ICbranch = 0.10 IC

IC' = 0.25 IC + 0.05 IC + 0.10 IC + 0.05 IC + 0.10 IC = 0.55 IC

Page 115: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-39Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Replacing Instructions — 3

20.10 /0.55Branch

20.05 /0.55Store

20.10 /0.55Load

40.05 /0.55FPU

10.25/0.55ALU

CPIiICi / ICType i

New instruction distribution

'' '

0.25 0.05 0.10 0.05 0.101 4 20.55 0.55 0.55 0.55 0.55

0.950.55

ii

i

ICCPI CPIIC

= ×

⎛ ⎞= × + × + × + +⎜ ⎟⎝ ⎠

=

( )1.25 1.25 1.320.95' ' ' ' 0.950.55

0.55

T CPI IC ICST CPI IC IC

ττ

× × ×= = = = ≅

× × ×

25.173.1' =>= CPICPI

Page 116: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-40Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Load‐Store versus Register‐MemoryCPU-1 is a load-store machine

ALU operands must come from registerMemory operand

Loaded to register before ALU operationStored to memory after ALU operation

Instruction distribution

Possible enhancement25% of ALU memory operands used in only 1 ALU operationCan register-memory ALU operations improve performance?

420%Branch

415%Store

525%Load

440%ALU

CPIiICi / ICType i

5 0.25 4 0.754.25

ii

i

ICCPI CPIIC

= ×

= × + ×=

Page 117: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-41Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Load‐Store versus Register‐MemoryCPU-2 is an "ideal" register-memory machine

ALU operands may come from register or memory75% of memory operands

Used in multiple ALU operationsPerfect compiler loads "multiple" memory operands to registers

25% of ALU memory operands Used in only a single ALU operationPerfect compiler never loads "single" memory operands to registers

Convert CPU-1 to CPU-2Split ALU operations into ALUmulti and ALUsingle

Replace ALUsingle with ALUregister-memory

Cancel 1 register load for every ALUsingle

Page 118: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-42Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Detailed Instruction Distribution

420%Branch

415%Store

525%Load

440%ALU

CPIiICi / ICType i

30%ALUmulti

10%ALUsingle

25% - 10% =15%Loadmulti

10%Loadsingle

515%Loadmulti

420%Branch

415%Store

510%Loadsingle

430%ALUmulti

410%ALUsingle

CPIiICi / ICType i

910%ALUregister-memory

reg mem ALU LoadALUCPI CPI CPI− = +

Page 119: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-43Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

New Instruction Distribution and Speedup

515%Loadmulti

420%Branch

415%Store

510%Loadsingle

430%ALUmulti

410%ALUsingle

CPIiICi / ICType i 910%ALUregister-memory

' '

0.10 0.30 0.15 0.15 0.200.90

reg mem multi multi Store BranchALU ALU LoadIC IC IC IC IC IC

IC IC IC IC ICIC

−= + + + +

= + + + +=

415/90Store

515/90Loadmulti

420/90Branch

430/90ALUmulti

910/90ALUregister-memory

CPIiICi / ICType i'' '

65 15 104 5 990 90 90

425 4.7290

ii

i

ICCPI CPIIC

= ×

= × + × + ×

= =

( )4.25 1425' ' ' ' 0.90

90

No Change in Performance T CPI IC ICST CPI IC IC

ττ

× × ×= = = = ⇒

× × ×

Page 120: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-44Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Analysis of 8086 Example8086 program compiled from C source

Instruction Clock Cycles Runs Type

MOV WORD PTR [BP-02],0000 13 1 Store

start: CMP WORD PTR [BP-02],N 10 N ALUimm-mem

JGE stop 4/13 N-1 / 1 Conditional Branch

MOV AX,[BP-02] 9 N-1 Load

SHL AX,1 2 N-1 ALUreg

MOV [BP-04],AX 12 N-1 Store

INC WORD PTR [BP-02] 15 N-1 ALUreg-mem

JMP start JMP 14 N-1 Unconditional Branch

stop: RET RET 16 1 Return

Page 121: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-45Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

CPI for Store and ALUJGE

Runs N-1 times in 4 clock cycles and 1 time in 13 clock cycles

( )( )

( )4 1 13 1 4 1 131 1

cycles instructionsJGE

N NJGECPIJGE N N

× − + × × − += = =

− +

STORE

Runs N-1 times in 12 clock cycles and 1 time in 13 clock cycles

( )( )

( )12 1 13 1 12 1 131 1

cycles instructionsSTORE

N NSTORECPISTORE N N

× − + × × − += = =

− +

Page 122: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-46Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Instruction Distribution

10ALUimm-mem

2ALUreg

16Return

14Unconditional Branch

[4(N–1) + 13] / NConditional Branch

15ALUreg-mem

[12(N–1) +13] / NStore

9Load

CPIiICi / ICType i

( )

17 3

1 17 3

17 3

17 3

7 3

7 31

7 31

7 3

NN

NNNNNNN

NN

NNN

N

−−

+ −−−−−−

−−−

Page 123: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-47Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Instruction Distribution for Loop Length = 100

1014.34%ALUimm-mem

214.20%ALUreg

160.14%Return

1414.20%Unconditional Branch

4.0914.34%Conditional Branch

1514.20%ALUreg-mem

12.0114.34%Store

914.20%Load

CPIiICi / ICType i

( ) ( )9 15 2 14 0.1420 12.01 10 4.09 0.1434 16 0.0014 9.45

ii

i

ICCPI CPIIC

= ×

= + + + × + + + × + × =

7 3 697IC N= − =

100N =

( ) ( ) 39.45 6971.646 10

4 cycles / instruction instructions

sec MHz

CPI ICTR

−××= = = ×

Estimated run time for 296 MHz UltraSPARC II = 4.71 × 10-7 sec

3

7

1.646 10 34944.71 10

S−

×= =

×

Page 124: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-48Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Register Variables

Improved programMemory variables replaced with register variables

Instruction Clock Cycles Runs Type

MOV SI,0000 4 1 MOVimm-reg

start: CMP SI,+0A 3 N ALUimm-reg

JGE stop 4/13 N-1 / 1 Conditional Branch

MOV AX,SI 2 N-1 MOVreg-reg

SHL AX,1 2 N-1 ALUreg

MOV DI,AX 2 N-1 MOVreg-reg

INC SI 3 N-1 ALUreg-reg

JMP start JMP 14 N-1 Unconditional Branch

stop: RET RET 16 1 Return

Page 125: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-49Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

New Instruction Distribution

3ALUimm-reg

2ALUreg

16Return

14Unconditional Branch

[4(N–1) + 13] / NConditional Branch

3ALUreg-reg

2MOVreg-reg

4MOVimm-reg

CPIiICi / ICType i

( )

17 32 17 3

17 3

17 3

7 3

7 31

7 31

7 3

NNNNNNNN

NN

NNN

N

−−−−−−−

−−−

Page 126: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-50Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

New Distribution for Loop Length = 100

314.35%ALUimm-reg

214.20%ALUreg

160.14%Return

1414.20%Unconditional Branch

4.0914.35%Conditional Branch

314.20%ALUreg-reg

228.41%MOVreg-reg

40.14%MOVimm-reg

CPIi'ICi' / IC'Type i

( ) ( ) ( )

'' ''

4 16 0.0014 2 0.2840 3 2 14 0.1420 3 4.09 0.1435 4.31

ii

i

ICCPI CPIIC

= ×

= + × + × + + + × + + × =

7 3 697IC N= − =

100N =

( ) ( ) 44.31 697' ' 7.515 104

cycles / instruction instructionssec

MHz

CPI ICTR

−××= = = ×

Run time with memory variables = 1.646 × 10-3 sec 3

4

1.646 10 2.197.515 10

S−

×= =

×

Page 127: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-51Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Software Content — Instruction Distribution

4

4

8

2

CPIi

150Store

250Load

200Branch

400ALU

ICiType i

Instruction CountIC

Cycles Per InstructionCPI

Page 128: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-52Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Software Content — Instruction Distribution

4

4

8

2

CPIi

1000IC

150Store

250Load

200Branch

400ALU

ICiType i

∑= ii

IC IC

Page 129: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-53Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Software Content — Instruction Distribution

15%

25%

20%

40%

ICi/IC

4

4

8

2

CPIi

1000IC

150Store

250Load

200Branch

400ALU

ICiType i

Page 130: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-54Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Software Content — Instruction Distribution

0.6

1.0

1.6

0.8

CPIi * ICi/IC

15%

25%

20%

40%

ICi/IC

4

4

8

2

CPIi

1000IC

150Store

250Load

200Branch

400ALU

ICiType i

Page 131: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-55Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Software Content — Instruction Distribution

4.0CPI

0.6

1.0

1.6

0.8

CPIi * ICi/IC

15%

25%

20%

40%

ICi/IC

4

4

8

2

CPIi

1000IC

150Store

250Load

200Branch

400ALU

ICiType i

×∑= ii

i

ICCPI CPI

IC

Page 132: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-56Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Software Content — Instruction Distribution

4.0CPI

0.6

1.0

1.6

0.8

CPIi * ICi/IC

15%

25%

20%

40%

ICi/IC

4

4

8

2

CPIi

1000IC

150Store

250Load

200Branch

400ALU

ICiType i

= × = × == × ×τ = τ

N CPI IC 4.0 1000 4000T CPI IC 4000

Page 133: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-57Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Software Content — Instruction Distribution

4.0CPI

600

1000

1600

800

Ni

0.6

1.0

1.6

0.8

CPIi * ICi/IC

15%

25%

20%

40%

ICi/IC

4

4

8

2

CPIi

1000IC

150Store

250Load

200Branch

400ALU

ICiType i

= ×i i iN CPI IC

Page 134: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-58Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Software Content — Instruction Distribution

4.0CPI

600

1000

1600

800

Ni

0.6

1.0

1.6

0.8

CPIi * ICi/IC

15%

25%

20%

40%

ICi/IC

4

4

8

2

CPIi

1000IC

150Store

250Load

200Branch

400ALU

ICiType i

= × = × =

= + + + =∑= ii

N CPI IC 4.0 1000 4000

N N 800 1600 100 600 4000

Page 135: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-59Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Software Content — Instruction Distribution

15%

25%

40%

20%

Fi

4.0CPI

600

1000

1600

800

Ni

0.6

1.0

1.6

0.8

CPIi * ICi/IC

15%

25%

20%

40%

ICi/IC

4

4

8

2

CPIi

1000IC

150Store

250Load

200Branch

400ALU

ICiType i

× ×τ ×= = = =

× ×τ ×i i i i i i

it CPI IC CPI IC N

FT CPI IC CPI IC N

Page 136: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-60Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Enhancement

19%

31%

25%

25%

Fi'

3.2CPI'

600

1000

800

800

Ni'

0.6

1.0

1.6 → 0.8

0.8

CPIi' * ICi/IC

15%

25%

20%

40%

ICi/IC

4

4

8 → 4

2

CPIi'

1000IC

150Store

250Load

200Branch

400ALU

ICiType i

e

× ×τ= = = = =

× ×τ

= = = = =+− +− + e

e

T CPI IC CPI 4.0CPU Equation S 1.25

T' CPI' IC' ' CPI' 3.2T 1 1 1

Amdahl Equation S 1.25F 0.4T' 0.6 0.21 0.41 F

2S× ×τ

= = =× ×τ

ee

e

t 8 ΙCS 2

t ' 4 ΙC

Page 137: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

3-61Dr. Martin LandQuantitative Performance TheoryComputer Architecture — Hadassah College — Spring 2019

Instruction DistributionsCPU analysis

Permits performance analysis of machine design "on drawing board"Evaluate proposed design without building CPU implementation

Summary of procedure

Specify Instruction Set Architecture (ISA)Describes machine language for proposed CPUProvides human-readable assembly languageDetermines CPIi for each instruction group i

Count clock cycles to implement a single instruction in ISA

Write C, C++, Fortran compilers for proposed machine languageCompile representative programs to machine language

Can use programs from SPEC CINT and CFP

Sort instructions into groups to find relative instruction count ICi/ICCalculate average CPI and run time TCompare run time with reference machine

Page 138: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-1Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

From CISC to RISC

Page 139: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-2Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

CISC Creates the Anti‐CISC Revolution Digital Equipment Company (DEC) introduces VAX (1977)

Commercially successful 32-bit CISC minicomputer

In 1970s and 1980s CISC minicomputers became cheaperSerious computers became available to small organizationsUNIX developed as minicomputer operating systemTCP/IP developed to support networks of minicomputersComputer Science emerged as separate academic disciplineStudents needed topics for final projects, theses, dissertations

Research results on CISC performance Most machine instructions are never usedCISC implementations give up speed in favor of generalityCISC machines run slowly to support unnecessary features

Page 140: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-3Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

CISC LimitationsCISC instruction set requires microcode

Many different instruction typesEach instruction requires different implementation

Complex operationsMany instructions require complex decoding and sequencing

Central bus organizationAtomic microcode operationsSystem bus = bottleneck

Microcode operations — sequentialMachine instructions — sequential

Machine instruction executes in multiple clock cycles

Memory access Operation complexity — non-uniform instruction lengthInstruction fetch — multiple clock cycles to load instruction

Main Memory

Registers

MAR MDR+PCIRDecoderStatusWord

Address Data

ALU Subsystem

System Bus

INOUT

ALU Operation

1

23

ALU Result Flag

control

Page 141: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-4Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

RISC "Philosophy"Technological developments from 1975 to 1990

Price of RAM — from $5000 / MByte (1975) to $5 / MByte (1990)Compilers — powerful and efficient with extensive optimizationUnix, C, and TCP/IP — practical portable code

Principal research result on CISC performance~ 90% of run time = ~ 10% of VAX ISA~ 90% of VAX instruction set < 10% of run time

Reduced Instruction Set Computer (RISC) — 1984Apply Amdahl's "Law" to Instruction Set Architecture (ISA)

Speed up operations accounting for most of run timeIgnore performance degradation to other instructions

RISC ISA — keep most important instructions from CISC ISAOther CISC instructions implemented as multiple RISC instructions

Simple hardware implementation — faster execution

Page 142: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-5Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

RISC MicroprocessorsSimpler ISA

Fewer machine instructionsAll instructions are same length

Simpler hardware design Allows lower CPIi and higher clock speedNo microcode — all instructions implemented in similar wayNo dedicated system busCPU can process several instructions at onceAn instruction completes execution on almost every clock cycle

High level program compiled to RISC Larger ICi — more machine instructions than compilation for CISCRun more quickly than same high level programs on CISC

All processors today use RISC technologyPure RISC (IBM Power, SPARC, MIPS, ARM, …)RISC technology for CISC language Intel x86 (Pentium, Core, Xeon) Explicitly parallel RISC (Intel Itanium, IBM mainframes)

Page 143: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-6Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

CISC vs. Pure RISC 

CISC RISC Instruction Types 300 50 Addressing Modes 15 5 Data Types 10 2 Procedure Handling Automated Coded Implementations Complex Simple Memory Organization Complex Simple

( )

' 12

6

3

CISC

RISC

CISCCISC CISC CISC

RISC RISC RISC

CISC

RISC

CISC

RISC

CISC

RIS

R SC

C

I

ICIC

CPIC

T CPI ICST CPI IC PI

⎛ ⎞≈⎜ ⎟

× ×τ= = = × ×

× ×τ

= × ×

ττ

ττ

ττ

×

⎠≈

Page 144: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-7Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Designing a RISC ISA

Page 145: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-8Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Considerations for a RISC ISAGoals

Simple — no instruction should require more steps than othersComplete — able to perform any desired computationOrthogonal — only one way to encode any given computation

ChoicesComputation model

Register-registerRegister-memory

Range and type of operationsOperands

Data types Data sizes

Addressing modes Displacement sizes

Branch typesConditionalUnconditionalProcedural (call/return)Branch offset (length of jump)

Page 146: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-9Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Instruction Types Representative instruction distribution

Five programs from SPECint92 benchmark suite Compile for x86 instruction set (ISA for Intel 386/486/Pentium)

Instruction Relative Proportion of Total Run Time

Load 22% Conditional branch 20% Compare 16% Store 12% Add 8% And 6% Sub 5% Move reg-reg 4% Call 1% Return 1% Other 5% Total 100%

Ref: Hennessy / Patterson, figure 2.11

First 10 instructions accountfor 95% of run time

Amdahl's "Law" Fast implementation of 95%Other 5% will not seriously

degrade performance

Must include unconditionalbranch for completeness

Page 147: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-10Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Addressing Modes Graph

Ref: Hennessy / Patterson, figure 2.6

Page 148: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-11Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Addressing Modes Representative instruction distribution

Three programs from SPEC CINT92 and SPEC CFP95 benchmarksCompile for VAX instruction set

Mode tex spice gcc Example of Mode register deferred 24 3 11 mem[R1] immediate 43 17 39 #11223344 displacement 32 55 40 mem[R1 + disp] memory indirect 1 6 1 mem[mem[R1]] scaled 0 16 6 mem[R1 + R2 * d + disp] other 0 3 3 total 100 100 100 total (top 3) 99 75 90

First three addressing modes Account for more than 75% of all operand accesses

Ref: Hennessy / Patterson, figure 2.6

Page 149: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-12Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Instruction LengthInstructions should be of uniform length

Simplifies instruction DECODING

No need to calculate instruction lengthInstruction fields are always in same place

Enables INSTRUCTION FETCH in 1 clock cycle

Practical instruction lengthsMost RISC machines for servers/workstations use 32-bit instructionsSpecial purpose RISC machines use longer instructionsItanium and mainframes use 128-bit instructions

ISA defines 32-bit instructionsNo single field can be 32 bits longIncludes address displacements, immediates, branch length

32 bits

operandsop code

Page 150: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-13Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Length of Immediate Operand Graph

Ref: Hennessy / Patterson, figure 2.9

Page 151: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-14Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Length of Immediate OperandRepresentative instruction distribution

Three programs from SPEC CINT92 and SPEC CFP95 benchmarksCompile for VAX instruction set

Ref: Hennessy / Patterson, figure 2.9

Immediate size tex spice gcc 0 3 1 1 4 45 13 50 8 4 35 22 12 3 15 4 16 15 14 3 20 25 10 18 24 2 12 0 28 1 0 0 32 2 0 2

Total 100 100 100 Total to 16 bits 70 78 80

Allocating 16 bits in 32-bit instruction for immediate operands covers more than 70% of cases

#1122

Page 152: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-15Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Displacement Length Graph

Ref: Hennessy / Patterson, figure 2.7

Page 153: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-16Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Displacement Length Representative instruction distribution

Programs from SPEC CINT92 and SPEC CFP95 benchmarksCompile for VAX instruction set

Bits in address displacement int FP 0 26 7 1 1 0 2 6 6 3 12 8 4 16 5 5 6 10 6 10 4 7 6 3 8 2 5 9 1 1 10 1 10 11 0 4 12 0 7 13 1 6 14 0 4 15 12 20

Total 100 100

Ref: Hennessy / Patterson, figure 2.7

Allocating 16 bits foraddress displacementscovers almost all cases

mem[R1 + 1122]

Page 154: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-17Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Branch Instructions Graph

Ref: Hennessy / Patterson, figure 2.12

Page 155: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-18Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Branch Instructions Representative instruction distribution

Programs from SPEC CINT92 and SPEC CFP95 benchmarksCompile for VAX instruction set

Integer FP Call / Return 13 10 Unconditional Branch 6 4 Conditional branch 81 86 Total 100 100 Total of Conditional and Unconditional Branch 87 90

Ref: Hennessy / Patterson, figure 2.12

Conditional branch accounts for more than 80% of all branch instructions

Unconditional branch must be included for completenessCall and return

Include many steps — saving registers and branchingAre difficult to implement

Page 156: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-19Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Branch Offset Graph

Ref: Hennessy / Patterson, figure 2.13

Page 157: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-20Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Branch Offset Representative instruction distribution

Programs from SPEC CINT92 and SPEC CFP95 benchmarksCompile for VAX instruction set

Offset bits for branch address int FP

0 0 0 1 1 0 2 13 36 3 26 21 4 16 11 5 24 12 6 6 9 7 5 6 8 6 4 9 2 1 10 1 0 11 0 0 12 0 0 13 0 0 14 0 0 15 0 0

Total 100 100

Ref: Hennessy / Patterson, figure 2.13

Allocating 16 bits forbranch offsetscovers almost all cases

PC ← PC + 1122

Page 158: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

4-21Dr. Martin LandFrom CISC to RISCComputer Architecture — Hadassah College — Spring 2019

Summary — RISC ISA By the NumbersInstruction Types

10 instructions cover 95% of run timeChoose 30 – 50 most necessary / convenient instructions

Addressing Modes Register ImmediateDisplacement

Instruction Length32-bit instructions

Branch InstructionsConditional branchUnconditional branch

Length of immediate values16-bit length for

Immediate operandDisplacementBranch offset

75% – 90% of run time addressing modes

75% – 90% of run time addressing modes

70% – 80% of run time immediates100% of run time address displacements100% of run time branch offsets

Page 159: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-1Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

DLX Architecture

A Model RISC Processor

Page 160: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-2Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

DLX Architecture —General FeaturesFlat memory model with 32-bit address

Data typesIntegers (32-bit)Floating Point

Single precision (32-bit)Double precision (64 bits)

Register-register operation model

32 integer registers (32 bits wide)Named R0, R1, ... , R31Addressed as 00000 to 11111 in register address spaceReg[R0] = 0 (constant)Other registers identical (no special purpose registers)

32 FP registers (32 bits wide)F0, F1, ... , F31Satisfy IEEE 754 standard FP formatStore double precision FP is register pair (even , odd)

R0 R1 R2 ... R31

F0 F1 F2 ... F31

instructioncacheALU

FPU

datacache

Page 161: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-3Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Addressing Modes

Three memory addressing modes implemented using Displacement

100(R1) Reg[R3] ← Mem[100+Reg[R1]]

Register Deferred0(R1) Reg[R3] ← Mem[0+Reg[R1]]

Absolute100(R0) Reg[R3] ← Mem[100+Reg[R0]]

Register ADD R3, R4, R5 Reg[R3] ← Reg[R4] + Reg[R5] Immediate ADD R3, R4, #3 Reg[R3] ← Reg[R4] + 3 Displacement LW R3, 100(R1) Reg[R3] ← Mem[100+Reg[R1]] Register Deferred LW R3, 0(R1) Reg[R3] ← Mem[Reg[R1]] Absolute LW R3, 100(R0) Reg[R3] ← Mem[100]

Page 162: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-4Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Data Transfer Instructions

LW R1, 30(R2) Load Word Reg[R1] ←32 Mem[30 + Reg[R2]]

SW 30(R2), R1 Store Word Mem[30 + Reg[R2]] ←32 Reg[R1]

LB R1, 30(R2) Load Byte Reg[R1] ←32 (Mem[30 + Reg[R2]]0)24 ## Mem[30 + Reg[R2]]

SB 30(R2), R1 Store Byte Mem[30 + Reg[R2]] ←8 Reg[R1]24..31

LBU R1, 30(R2) Load Byte

unsigned Reg[R1] ←32 024 ## Mem[30 + Reg[R2]]

LH R1, 30(R2) Load Half Word

Reg[R1] ←32 (Mem[30 + Reg[R2] ]0)16 ## Mem[30 + Reg[R2]]

LF F1, 30(R2) Load Float Reg[F1] ←32 Mem[30 + Reg[R2]]

SF 30(R2), F1 Store Float Mem[30 + Reg[R2]] ←32 Reg[F1]

MOVF F3, F1 Move Float Reg[F3] ←32 Reg[F1]

MOVD F2, F0 Move Double Reg[F2],Reg[F3] ←64 Reg[F0],Reg[F1]

MOVFP2I R2, F2 FP to INT Reg[R2] ←32 Reg[F2]

MOVI2FP F2, R2 INT to FP Reg[F2] ←32 Reg[R2]

Page 163: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-5Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Arithmetic/Logic Instructions ADD R1, R2, R3 Add Reg[R1] ← Reg[R2] + Reg[R3] ADDI R1, R2, #3 Add Immediate Reg[R1] ← Reg[R2] + 3 SUB R1, R2, R3 Sub Reg[R1] ← Reg[R2] - Reg[R3] SUBI R1, R2, #3 Sub Immediate Reg[R1] ← Reg[R2] - 3 MULT R1, R2, R3 Multiply Reg[R1] ← Reg[R2] * Reg[R3] DIV R1, R2, R3 Divide Reg[R1] ← Reg[R2] ÷ Reg[R3] AND R1, R2, R3 And Reg[R1] ← Reg[R2] AND Reg[R3] ANDI R1, R2, #3 And Immediate Reg[R1] ← Reg[R2] AND 3 OR R1, R2, R3 Or Reg[R1] ← Reg[R2] OR Reg[R3] ORI R1, R2, #3 Or Immediate Reg[R1] ← Reg[R2] OR 3 XOR R1, R2, R3 Exclusive Or Reg[R1] ← Reg[R2] XOR Reg[R3]

XORI R1, R2, #3 Exclusive Or Immediate Reg[R1] ← Reg[R2] XOR 3

LHI R1, #42 Load High Reg[R1] ← 42 ## 016

SLT R1, R2, R3 Set Less Than if Reg[R2] < Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0

SGT R1, R2, R3 Set Greater Than

if Reg[R2] > Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0

SLE R1, R2, R3 Set Less Than or Equal

if Reg[R2] ≤ Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0

SGE R1, R2, R3 Set Greater Than or Equal

if Reg[R2] ≥ Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0

SEQ R1, R2, R3 Set Equal if Reg[R2] = Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0

SNE R1, R2, R3 Set Not Equal if Reg[R2] ≠ Reg[R3] then Reg[R1] ← 1 else Reg[R1] ← 0

Page 164: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-6Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Floating Point Instructions ADDF F1, F2, F3 Add Float Reg[F1] ← Reg[F2] + Reg[F3]

ADDD F0, F2, F4 Add Double ⎟⎟⎠

⎞⎜⎜⎝

⎛+⎟⎟

⎞⎜⎜⎝

⎛⎯⎯←⎟⎟

⎞⎜⎜⎝

⎛Reg[F5]

Reg[F4]

Reg[F3]

Reg[F2]

Reg[F1]

Reg[F0]64

SUBF F1, F2, F3 Sub Float SUBD F0, F2, F4 Sub Double

MULTF F1, F2, F3 Multiply Float

MULTD F0, F2, F4 Multiply Double

DIV F1, F2, F3 Divide Float DIVD F0, F2, F4 Divide Double

NOTE: Floating point numbers are represented as single or double

precision numbers according to IEEE 754.

The ALU functions for FP are not simple binary operations on the bits

in the register.

LTF F2, F3 Set Less Than if Reg[F2] < Reg[F3] then StatFP ←1 1 else StatFP ←1 0

GTF F2, F3 Set Greater Than

if Reg[F2] > Reg[F3] then StatFP ←1 1 else StatFP ←1 0

LEF F2, F3 Set Less Than or Equal

if Reg[F2] ≤ Reg[F3] then StatFP ←1 1 else StatFP ←1 0

GEF F2, F3 Set Greater Than or Equal

if Reg[F2] ≥ Reg[F3] then StatFP ←1 1 else StatFP ←1 0

EQF F2, F3 Set Equal if Reg[F2] = Reg[F3] then StatFP ←1 1 else StatFP ←1 0

NEF F2, F3 Set Not Equal if Reg[F2] ≠ Reg[F3] then StatFP ←1 1 else StatFP ←1 0

LTD, GTD, LED, GED, EQD, NED Double precision comparisons

Page 165: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-7Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Control Instructions

J offset Jump PC ← PC + offset (-225 ≤ offset ≤ 225 - 1)

JAL offset Jump and Link

Reg[R31] ← PC PC ← PC + offset

(-225 ≤ offset ≤ 225 - 1)

JR R3 Jump Register PC ← Reg[R3]

JALR R2, offset Jump and

Link Register

Reg[R2] ← PC PC ← PC + offset

(-215 ≤ offset ≤ 215 - 1)

BEQZ R4, offset Branch equal zero

if Reg[R4] == 0 then PC ← PC + offset (-215 ≤ offset ≤ 215 - 1)

BNEZ R4, offset Branch not equal zero

if Reg[R4] != 0 then PC ← PC + offset (-215 ≤ offset ≤ 215 - 1)

TRAP N Software interrupt Details not specified in Hennessy and Patterson

Note: Register NPC is updated (NPC ← PC + 4) when branch instruction is loaded Register PC is updated (PC ← NPC or PC ← NPC + offset) at end of instruction execution

Page 166: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-8Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Programming in DLX Assembly Language

ADDI R1, R0, #0x400 ; 256 integers = 1024 bytes = 400h bytes

LW R2, -4(R1) ; load word from a[] (400 – 4 = 3FC) LW R3, 3FC(R1) ; load word from b[] (400 + 3FC = 7FC)ADD R4, R2, R3 ; addLW R2, 7FC(R1) ; load word from c[] (400 + 7FC = BFC)SUB R4, R4, R2 ; subLW R2, BFC(R1) ; load word from d[] (400 + BFC = FFC)ADD R4, R4, R2 ; addSW -4(R1), R4 ; store sum in a[]SUBI R1, R1, #4 ; i--BNEZ R1, -0x28 ; if R1 <> 0 jump 10 back instructions

for ( i = 0 ; i < 256 ; i++)a[i] = a[i] + b[i] – c[i] + d[i]

}

a[] = 000 – 3FFb[] = 400 – 7FFc[] = 800 – BFFd[] = C00 – FFF

Page 167: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-9Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Implementation

General approachNo central system busBase hardware organization on assembly line with uniform operations Separate memory for instructions and data

High level designInstructions move through 5 stages (left to right)

First two stages identical for all instructions — FETCH and DECODE

Last three stages operate according to instruction

EXECUTE (ALU instructions and address calculations)MEMORY ACCESS (Load/Store instructions)WRITE BACK (register update for Load and ALU instructions)

InstructionFetch

InstructionMemory

InstructionDecode Execute Data

Access

DataMemory

WriteBack

Address Instruction Address Data

Page 168: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-10Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

RISC PerformanceCompare VAX with MIPS 2000 (RISC CPU) on SPEC 89 results

Same clock rate

Ref: Hennessy-Patterson Figure 2-30

6 312

VAX VAX

MIPS MIPS

CPI ICSCPI IC

ττ

× ×= ≈ × =

× ×

2MIPS

VAX

ICIC

16

MIPS

VAX

CPICPI

Page 169: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-11Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Instruction Formats32-bit instructions (0 to 31)

Three instruction formatsJ-type

Jump (unconditional branch) instructionsSpecifies branch offset

R-typeRegister-register ALU instructionsSpecifies destination register (rd), and two source registers (rs1, rs2)

I-typeAll other instructionsSpecifies destination register (rd), immediate, and source register (rs)

0-5 6-10 11-15 16-31 Type 6 5 5 5 11

R opcode rs1 rs2 rd function I opcode rs rd immediate J opcode offset

Page 170: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-12Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

J‐Type Instruction Format 

6 26Opcode Offset added to PC

Encodes: • Jump PC ← PC + offset

• Jump and link r31 ← PC PC ← offset

• Trap and return from exception Implementation unspecified in Hennessy and Patterson Two possible implementations for Offset field 1. Lower 26 bits of physical address of Interrupt Service Routine 2. Trap number = index to Interrupt Vector Table

Page 171: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-13Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

R ‐ Type Instruction

6 5 5 5 11Opcode rs1 rs2 rd function

Encodes: • Register-register ALU operations rd ← rs1 function rs2

Function encodes the ALU operation: Add, Sub, ...

Page 172: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-14Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

I ‐ Type Instruction

6 5 5 16Opcode rs rd Immediate

Encodes: • Loads rd ← imm(rs)

• Stores imm(rs) ← rd

• ALU operations with immediate operand rd ← rs op immediate

• Conditional branch instructions if rs eq/ne 0 then PC ← PC + imm (rd unused)

• Jump register PC ← rs

• Jump and link register rd ← PC PC ← PC + immediate

Page 173: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-15Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

ImplementationDetails

Page 174: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-16Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Execution Stages by Instruction Type

Write loaded data to register

Update PC

Write result to register

Update PC

Update PCLoad data

from memory

Store data

to memory

Update PC

Calculate branch condition 

Calculate branch address

Calculate memory address

Calculate memory address

Calculate ALU operation 

Decode operation and operands

Decode operation and operands

Decode operation and operands

Decode operation and operands

Fetch instruction from memory

Fetch instruction from memory

Fetch instruction from memory

Fetch instruction from memory

BranchLoadStoreALU

Page 175: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-17Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Temporary Registers for ImplementationIR

Instruction RegisterHolds fetched instruction during execution

PCProgram CounterMemory address of next instruction

NPCNext Program CounterTemporary update of PC (points to fall-through instruction)

A, B, IOperand buffersValues read from data registers according to instruction

ALUout

ALU outputResult of ALU operation

LMDLoad Memory DataData loaded from memory

CondCondition flagResult of test for conditional branch

Page 176: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-18Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Example Type‐I ALU Instruction

Instruction addi R1, R2, #5

Operation Reg[R1] ← Reg[R2] + 5

0-5 6-10 11-15 16-31 addi 00010 00001 0000 0000 0000 0101 Encoding

op rs rd immediate Hardware Stage 1

IR ← Mem[PC] NPC ← PC + 4

Hardware Stage 2

A ← Reg[IR6-10] /* A ← Reg[R2] */ B ← Reg[IR11-15] /* B ← Reg[R1] */ I ← (IR16)16 ## IR16-31

Hardware Stage 3

ALUout ← A + I

Hardware Stage 4

Hardware Stage 5

Reg[IR11-15] ← ALUout /* Reg[R1] ← A + I */ PC ← NPC

Page 177: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-19Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Example Type‐R ALU Instruction

Instruction add R1, R2, R3

Operation Reg[R1] ← Reg[R2] + Reg[R3]

0-5 6-10 11-15 16-20 21-31 R-R 00010 00011 00001 add Encoding

op rs1 rs2 rd funct Hardware Stage 1

IR ← Mem[PC] NPC ← PC + 4

Hardware Stage 2

A ← Reg[IR6-10] /* A ← Reg[R2] */ B ← Reg[IR11-15] /* B ← Reg[R3] */ I ← (IR16)16 ## IR16-31

Hardware Stage 3

ALUout ← A + B

Hardware Stage 4

Hardware Stage 5

Reg[IR16-20] ← ALUout /* Reg[R1] ← A + B */ PC ← NPC

Page 178: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-20Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Example Type‐I Store Instruction

Instruction SW 32(R1), R2

Operation Mem[32+Reg[R1]] ← Reg[R2]

0-5 6-10 11-15 16-31 SW 00001 00010 0000 0000 0010 0000 Encoding

op rs rd immediate Hardware Stage 1

IR ← Mem[PC] NPC ← PC + 4

Hardware Stage 2

A ← Reg[IR6-10] /* A ← Reg[R1] */ B ← Reg[IR11-15] /* B ← Reg[R2] */ I ← (IR16)16 ## IR16-31

Hardware Stage 3

ALUout ← A + I

Hardware Stage 4

Mem[ALUout] ← B /* Mem[A+I] ← Reg[R2] */ PC ← NPC

Hardware Stage 5

Page 179: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-21Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Example Type‐I Load Instruction

Instruction LW R2, 32(R1)

Operation Reg[R2] ← Mem[32+Reg[R1]]

0-5 6-10 11-15 16-31 LW 00001 00010 0000 0000 0010 0000 Encoding

op rs rd immediate Hardware Stage 1

IR ← Mem[PC] NPC ← PC + 4

Hardware Stage 2

A ← Reg[IR6-10] /* A ← Reg[R1] */ B ← Reg[IR11-15] /* B ← Reg[R2] */ I ← (IR16)16 ## IR16-31

Hardware Stage 3

ALUout ← A + I

Hardware Stage 4

LMD ← Mem[ALUout] /* LMD ← Mem[A+I] */

Hardware Stage 5

Reg[IR11-15] ← LMD /* Reg[R2] ← Mem[A+I] */ PC ← NPC

Page 180: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-22Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Example Type‐I Conditional Branch Instruction

Instruction beqz R1, 1024

Operation if (Reg[R1] == 0) PC ← NPC + 1024 else PC ← NPC

0-5 6-10 11-15 16-31 beqz 00001 00000 0000 0100 0000 0000 Encoding

op rs rd immediate Hardware Stage 1

IR ← Mem[PC] NPC ← PC + 4

Hardware Stage 2

A ← Reg[IR6-10] /* A ← Reg[R1] */ B ← Reg[IR11-15] /* B ← Reg[R0] */ I ← (IR16)16 ## IR16-31

Hardware Stage 3

ALUout ← NPC + I if (A == 0) cond = 1 else cond = 0

Hardware Stage 4

if (cond == 1) PC ← ALUout else PC ← NPC

Hardware Stage 5

Page 181: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-23Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

DLX Hardware Drawing — Version 1

mux (multiplexer) — chooses 1 output from N inputs

Page 182: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-24Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I ALU Instruction — 1

PC

mem[PC]

PC + 4

addi r1, r2, #5 regs[r1] ← regs[r2] + 5

Page 183: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-25Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I ALU Instruction — 2

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

NPC

addi r1, r2, #5 regs[r1] ← regs[r2] + 5

Page 184: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-26Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I ALU Instruction — 3

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

A

I

A+I

A

NPC cond

NPC

addi r1, r2, #5 regs[r1] ← regs[r2] + 5

Page 185: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-27Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I ALU Instruction — 4

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

A

I

A+I

A

NPC cond

NPC

A+I

NPC

A+I

A+IReg[IR11-15]

NPC

addi r1, r2, #5 regs[r1] ← regs[r2] + 5

Page 186: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-28Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐R ALU Instruction — 1

PC

mem[PC]

PC + 4

add r1, r2, r3 regs[r1] ← regs[r2] + regs[r3]

Page 187: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-29Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐R ALU Instruction — 2

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

NPC

add r1, r2, r3 regs[r1] ← regs[r2] + regs[r3]

Page 188: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-30Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐R ALU Instruction — 3

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

A

B

A+B

A

NPC cond

NPC

add r1, r2, r3 regs[r1] ← regs[r2] + regs[r3]

Page 189: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-31Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐R ALU Instruction — 4

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

A

B

A+B

A

NPC cond

NPC

A+B

NPC

A+B

A+BReg[IR16-20]

NPC

add r1, r2, r3 regs[r1] ← regs[r2] + regs[r3]

Page 190: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-32Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Store Instruction — 1

PC

mem[PC]

PC + 4

sw 32(r1), r2 mem[32+ regs[r1]] ← regs[r2]

Page 191: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-33Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Store Instruction — 2

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

NPC

sw 32(r1), r2 mem[32+ regs[r1]] ← regs[r2]

Page 192: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-34Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Store Instruction — 3

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

A

I

A+I

A

NPC cond

NPC

B

sw 32(r1), r2 mem[32+ regs[r1]] ← regs[r2]

Page 193: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-35Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Store Instruction — 4

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

A

I

A+I

A

NPC cond

NPC

A+I

NPCNPC

B

B

sw 32(r1), r2 mem[32+ regs[r1]] ← regs[r2]

Page 194: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-36Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Load Instruction — 1

PC

mem[PC]

PC + 4

lw r2, 32(r1) regs[r2] ← mem[32+ regs[r1]]

Page 195: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-37Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Load Instruction — 2

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

NPC

lw r2, 32(r1) regs[r2] ← mem[32+ regs[r1]]

Page 196: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-38Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Load Instruction — 3

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

A

I

A+I

A

NPC cond

NPC

lw r2, 32(r1) regs[r2] ← mem[32+ regs[r1]]

Page 197: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-39Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Load Instruction — 4

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

A

I

A+I

A

NPC cond

NPC

A+I

NPC

mem[A+I]

lw r2, 32(r1) regs[r2] ← mem[32+ regs[r1]]

Page 198: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-40Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Load Instruction — 5

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

A

I

A+I

A

NPC cond

NPC

A+I

NPC

mem[A+I]

mem[A+I]Reg[IR11-15]

NPC

mem[A+I]

lw r2, 32(r1) regs[r2] ← mem[32+ regs[r1]]

Page 199: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-41Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Branch Instruction — 1

PC

mem[PC]

PC + 4

beqz r1, 1024 if (regs[r1] == 0) PC ← NPC + I else PC ← NPC

Page 200: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-42Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Branch Instruction — 2

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

NPC

beqz r1, 1024 if (regs[r1] == 0) PC ← NPC + I else PC ← NPC

Page 201: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-43Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Branch Instruction — 3

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

NPC

I

NPC+I

A

NPC cond

NPC

beqz r1, 1024 if (regs[r1] == 0) PC ← NPC + I else PC ← NPC

Page 202: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-44Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

Type‐I Branch Instruction — 4

PC

mem[PC]

PC + 4

Reg[IR6-10]

Reg[IR11-15]

Reg[IR16-31]

NPC

I

NPC+I

A

NPC cond

NPC

NPC+I

NPC / NPC+INPC / NPC+I

beqz r1, 1024 if (regs[r1] == 0) PC ← NPC + I else PC ← NPC

Page 203: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

5-45Dr. Martin LandDLX ArchitectureComputer Architecture — Hadassah College — Spring 2019

PerformanceInstruction distribution for version 1 based on compilation of

SPEC 92

420%Branch

415%Store

525%Load

440%ALU

CPIiICi / ICType i

4 0.40 5 0.25 4 0.15 4 0.254.25

ii

i

ICCPI CPIIC

= ×

= × + × + × + ×=

Page 204: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-1Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Speeding Up DLX

Page 205: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-2Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

DLX Execution Stages — Version 1Clock Cycle 1

I1 enters Instruction Fetch (IF)Clock Cycle2

I1 moves to Instruction Decode (ID)Instruction Fetch (IF) holds state fixed

Clock Cycle3I1 moves to Execute (EX)Instruction Fetch (IF) holds state fixedInstruction Decode (ID) holds state fixed

Clock Cycle4I1 moves to Memory Access (MEM)Instruction Fetch (IF) holds state fixedInstruction Decode (ID) holds state fixedExecute (EX) holds state fixed

Clock Cycle5I1 performs Write Back (WB) using instruction (IR) stored in IF stagePC updated and stages IF, ID, EX, MEM are reset

Page 206: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-3Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Room for ImprovementDLX based on assembly line

No central system busInstructions move from execution stage to execution stageAssembly line permits pipeliningIn each stage, new work begins when old work passes to next stage

CC1 CC2 CC3 CC4 CC5

InstructionFetch

InstructionMemory

InstructionDecode Execute Data

Access

DataMemory

WriteBack

Address Instruction Address Data

Page 207: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-4Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

DLX — Version 2

I1 moves to Write Back (WB)I2 and its execution state move to Memory Access (MEM)I3 and its execution state move to Execute (EX)I4 and its execution state move to Instruction Decode (ID)I5 enters Instruction Fetch (IF)

CC 5

I1 and its execution state move to Memory Access (MEM)I2 and its execution state move to Execute (EX)I3 and its execution state move to Instruction Decode (ID)I4 enters Instruction Fetch (IF)

CC 4

I1 and its execution state move to Execute (EX)I2 and its execution state move to Instruction Decode (ID)I3 enters Instruction Fetch (IF)

CC 3

I1 and its execution state move to Instruction Decode (ID)I2 enters Instruction Fetch (IF)

CC 2

I1 enters Instruction Fetch (IF)CC 1

Page 208: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-5Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Ideal Instruction Pipelining — Processor View

In any clock cycle (after CC 4)5 instructions are being processed at one timeEach instruction in a different stage of execution

IF ID EX MEM WB 1 I1 2 I2 I1 3 I3 I2 I1 4 I4 I3 I2 I1 5 I5 I4 I3 I2 I1

6 I6 I5 I4 I3 I2

7 I7 I6 I5 I4 I3

8 I8 I7 I6 I5 I4

stageclockcycle

Page 209: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-6Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Ideal Instruction Pipelining — Instruction View

1 2 3 4 5 6 7 8 I1 IF ID EX MEM WB I2 IF ID EX MEM WB I3 IF ID EX MEM WB I4 IF ID EX MEM WB I5 IF ID EX MEMI6 IF ID EX I7 IF ID I8 IF

clock cycle

Page 210: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-7Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Average CPI for DLX PipelineFrom diagram

I1 finishes after N=5 clock cyclesI2 finishes after N=6 clock cyclesI3 finishes after N=7 clock cycles

GenerallyIC instructions are finished after N = IC + 4 clock cycles

4

4 41 1IC

ICCPIIC IC >>

+= = = + ⎯⎯⎯→clock cycles

finished instructions

On averageOne instruction completes on every clock cycle

CPI is 1 clock cycle per instruction for DLX pipelineLimitation

Dependencies between instructions cause waiting conditions

Page 211: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-8Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Pipelining — Functional RequirementsEach stage receives a new instruction on every clock cycle

Cannot hold partial results for all instructionsMust pass along all intermediate results for every instruction

ExampleIF stage

Loads instruction to IRFinds NPC for next instructionPasses IR and NPC (intermediate results) to ID stage

ID stageStores received IR and NPC for incoming instructionDecodes IR to A, B, and IPasses IR, NPC, A, B, and I to EX stage

Stage buffersCollection of D-flip/flops (edge-triggered latches)Store intermediate results of each stage at end of clock cycle

Page 212: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-9Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Review — Synchronous TransferD-flip/flop (edge-triggered latch)

Input DOutput of some digital system

Output QChanges only on falling CLK edge

Trigger — 1-to-0 CLK transition

Q

D

CLK

1NCLK − NCLK CC N

D

CLK

Pr

Cr

Q

Q

D

CLK

Pr

Cr

Q

Q

D

CLK

Pr

Cr

Q

Q

...

D0 D1 Dn-1

Q0 Q1 Qn-1

CLK

Clock Cycle NCC N begins on CLKN-1

Input D can changeNo effect on latch

CC N ends on CLKN

Latch samples input DStores instantaneous input

value Forwards stored value to

output Q

Page 213: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-10Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Stage Buffers

5 execution stages built from Combinational logic — output = function (present input)Asynchronous memory — output = function (present input, past input)

4 stage buffers (edge-triggered latches) and PC built from Synchronous sequential logic

output = function (present input, past input, external clock)Store and forward input on falling edge of CLK

Described as data structure using C notation

IF/ID.NPC

IF/ID.IR

IF/ID

IFLogic

ID/EX.NPC

ID/EX.A

ID/EX.B

ID/EX.I

ID/EX.IR

ID/EX

IDLogic

EX/MEM.cond

EX/MEM.ALU

EX/MEM.B

EX/MEM.IR

EX/MEM

EXLogic

MEM/WB.ALU

MEM/WB.LMD

MEM/WB.IR

MEM/WB

MEMLogic

WBLogic

CLK

PC

Page 214: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-11Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

DLX Drawing — version 2

DLXv2

Page 215: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-12Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Formal Specification of Version 2

Instruction Fetch (IF)PC ← NPC

New PC for new instruction fetch in every clock cycle

IF/ID.IR ← Mem[PC]

Instruction Decode (ID)ID/EX.NPC ← IF/ID.NPCID/EX.A ← Reg[IF/ID.IR6-10]ID/EX.B ← Reg[IF/ID.IR11-15]ID/EX.I ← (IR16)

16 ## IF/ID.IR16-31ID/EX.IR ← IF/ID.IR

Stage Buffers (←) "See" inputs during clock cycleSample and store inputs on falling CLK at end of clock cycle

Type 0-5 6-10 11-15 16-31 R op rs1 rs2 rd function I op rs rd immediate

← ⎨⎩ OUT

PC + 4 (no branch)IF/ID.NPC

ALU (branch taken - special case)

Page 216: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-13Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Formal Specification of Version 2Execute (EX)

Memory (MEM)

Write Back (WB)

⎧⎪← ⎨⎪⎩

←←

OUT

EX/MEM.cond (ID/EX.A == 0)

ID/EX.A function ID/EX.B (R-ALU)

EX/MEM.ALU ID/EX.A op ID/EX.I (I-ALU, Memory)

ID/EX.NPC + ID/EX.I (Branch)

EX/MEM.B ID/EX.B

EX/MEM. IDR /EX.I IR

←←

OUT OUT

OUT

OUT

Mem L

MEM/WB.ALU EX/MEM.ALU

MEM/WB.LMD [EX/MEM.ALU ] ( )

[EX

oad

Mem Stor/MEM.ALU ] EX/MEM.B ( )e

MEM/WB. EX/MIR EM.IR

⎧← ⎨

⎩←

11-1OUT

OU

5

16-20 T

MEM/WB.ALU (I-ALU)[MEM/WB. ]

MEM/WB.LMD (Load)

[MEM/WB. ] MEM/WB.ALU (R-A

IRReg

LU)IRReg

Type 0-5 6-10 11-15 16-31 R op rs1 rs2 rd function I op rs rd immediate

Page 217: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-14Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Instruction Transfer Timing

IF/ID.NPC

IF/ID.IR

IF/ID

IFLogic

ID/EX.NPC

ID/EX.A

ID/EX.B

ID/EX.I

ID/EX.IR

ID/EX

IDLogic

EX/MEM.cond

EX/MEM.ALU

EX/MEM.B

EX/MEM.IR

EX/MEM

EXLogic

MEM/WB.ALU

MEM/WB.LMD

MEM/WB.IR

MEM/WB

MEMLogic

WBLogic

CLK

PC

IR1

IR1

IR1

IR1 IR1

EX/MEM.IR "sees" Mem[PC(I1)]ID/EX.IR "sees" Mem[PC(I2)] IF/ID.IR "sees" Mem[PC(I3)]

ID/EX.IR ← Mem[PC(I1)]IF/ID.IR ← Mem[PC(I2)]Memory ← PC(I3)

CC 3 beginsCLK 2

Mem[PC(I1)] controls Write BackMEM/WB.IR ← Mem[PC(I1)]CC 5 beginsCLK 4

MEM/WB.IR "sees" Mem[PC(I1)]...

EX/MEM.IR ← Mem[PC(I1)]...

CC 4 beginsCLK 3

ID/EX.IR "sees" Mem[PC(I1)]IF/ID.IR "sees" Mem[PC(I2)]

IF/ID.IR ← Mem[PC(I1)]Memory ← PC(I2)

CC 2 beginsCLK 1

IF/ID.IR "sees" Mem[PC(I1)]Memory ← PC(I1)CC 1 beginsCLK 0

DLXv2

Page 218: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-15Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Simple 5‐Instruction Program for DLX

AND R10, R12, R1310I5

LW R8, 32(R9)0CI4

SW 32(R6), R708I3

ADD R3, R4, R504I2

ADDI R1, R2, #500I1

InstructionAddressInstruction

Number

Page 219: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-16Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Program Execution Table

IF ID EX MEM WB

CC1

ADDI R1, R2, #5

IF/ID.IR ← Mem[00] IF/ID.NPC ← 04

CC2

ADD R3, R4, R5

IF/ID.IR ← Mem[04] IF/ID.NPC ← 08

ID/EX.NPC ← 04 ID/EX.A ← R2 ID/EX.B ← R1 ID/EX.I ← 5 ID/EX.IR ← ADDI R1, R2, #5

CC3

SW 32(R6), R7

IF/ID.IR ← Mem[08] IF/ID.NPC ← 0C

ID/EX.NPC ← 08 ID/EX.A ← R4 ID/EX.B ← R5 ID/EX.I ← ??? ID/EX.IR ← ADD R3, R4, R5

EX/MEM.cond ← (R2 == 0) EX/MEM.ALU ← R2 + 5 EX/MEM.B ← R1 EX/MEM.IR ← ADDI R1, R2, #5

CC4

LW R8, 32(R9)

IF/ID.IR ← Mem[0C] IF/ID.NPC ← 10

ID/EX.NPC ← 0C ID/EX.A ← R6 ID/EX.B ← R7 ID/EX.I ← 32 ID/EX.IR ← SW 32(R6), R7

EX/MEM.cond ← (R4 == 0) EX/MEM.ALU ← R4 + R5 EX/MEM.B ← R5 EX/MEM.IR ← ADD R3, R4, R5

MEM/WB.ALU ← R2 + 5 MEM/WB.IR ← ADDI R1, R2, #5

CC5

AND R10, R12, R13

IF/ID.IR ← Mem[10] IF/ID.NPC ← 14

ID/EX.NPC ← 10 ID/EX.A ← R9 ID/EX.B ← R8 ID/EX.I ← 32 ID/EX.IR ← LW R8, 32(R9)

EX/MEM.cond ← (R6 == 0) EX/MEM.ALU ← R6 + 32 EX/MEM.B ← R7 EX/MEM.IR ← SW 32(R6), R7

MEM/WB.ALU ← R4 + R5 MEM/WB.IR ← ADD R3, R4, R5

R1 ← R2 + 5

CC6

ID/EX.NPC ← 14 ID/EX.A ← R12 ID/EX.B ← R13 ID/EX.I ← ??? ID/EX.IR ← AND R10, R12, R13

EX/MEM.cond ← (R9 == 0) EX/MEM.ALU ← R9 + 32 EX/MEM.B ← R8 EX/MEM.IR ← LW R8, 32(R9)

Mem[R6 + 32] ← R7 MEM/WB.ALU ← R6 + 32 MEM/WB.IR ← SW 32(R6), R7

R3 ← R4 + R5

CC7

EX/MEM.cond ← (R12 == 0) EX/MEM.ALU ← R12 AND R2 EX/MEM.B ← R13 EX/MEM.IR ← AND R10, R12, R13

MEM/WB.LMD ← Mem[R9 + 32] MEM/WB.ALU ← R9 + 32 MEM/WB.IR ← LW R8, 32(R9)

CC8 MEM/WB.ALU ← R12 AND R2 MEM/WB.IR ← AND R10, R12, R13

R8 ← Mem[R9 + 32]

CC9 R10 ← R12 AND R2

Latch on CLK1 Latch on CLK2

DLXv2

Page 220: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-17Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

First Clock Cycles

After CLK0Memory ← PC =00 ⇒ IF/ID.IR "sees" Mem[00] and IF/ID.NPC "sees" 04 as

inputs After CLK 1

Memory ← PC =04 ⇒ IF/ID.IR "sees" Mem[04] and IF/ID.NPC "sees" 08 as inputs

IF/ID.IR latches Mem[00] and ID/EX.IR "sees" IF/ID.IR (ADDI R1, R2, #5) as input

R i t " " IF/ID IR d ID/EX A B I " " R2 R1 5 i t

IF ID EX

CC1

ADDI R1, R2, #5

IF/ID.IR ← Mem[00] IF/ID.NPC ← 04

CC2

ADD R3, R4, R5

IF/ID.IR ← Mem[04] IF/ID.NPC ← 08

ID/EX.NPC ← 04 ID/EX.A ← R2 ID/EX.B ← R1 ID/EX.I ← 5 ID/EX.IR ← ADDI R1, R2, #5

CC3

SW 32(R6), R7

IF/ID.IR ← Mem[08] IF/ID.NPC ← 0C

ID/EX.NPC ← 08 ID/EX.A ← R4 ID/EX.B ← R5 ID/EX.I ← ??? ID/EX.IR ← ADD R3, R4, R5

EX/MEM.cond ← (R2 == 0) EX/MEM.ALU ← R2 + 5 EX/MEM.B ← R1 EX/MEM.IR ← ADDI R1, R2, #5

CC4

LW R8, 32(R9)

IF/ID.IR ← Mem[0C] IF/ID.NPC ← 10

ID/EX.NPC ← 0C ID/EX.A ← R6 ID/EX.B ← R7 ID/EX.I ← 32 ID/EX.IR ← SW 32(R6), R7

EX/MEM.cond ← (R4 == 0) EX/MEM.ALU ← R4 + R5 EX/MEM.B ← R5 EX/MEM.IR ← ADD R3, R4, R5

DLXv2

Page 221: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-18Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Processor State Just Before CLK 4

Input and Output Data at Stage Buffers in CC 4

DLXv2

Page 222: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-19Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Processor State Just After CLK 4

Input and Output Data at Stage Buffers in CC 5

DLXv2

Page 223: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-20Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

New Technology, New Headaches

Analysis of Pipeline Hazards

Page 224: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-21Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Instruction Dependencies: DefinitionsInstruction dependencies

Result of one instruction needed to execute later instructionHazard

Processor runs smoothly but provides wrong answersPipeline hazard

Several instructions in various stages of executionPipeline uses a resource value before update by earlier instructionExample

PC ← NPC on each clock cycle

Branch instruction requires PC ← NPC+ICorrect evaluation of NPC+I not available on next clock cycle

Hazard TypesStructural Hazard — conflict over access to resource Data Hazard — instruction result not ready when neededControl Hazard — branch address not ready when needed

Page 225: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-22Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Dealing with HazardsAvoid error

Pause pipeline and wait for resource to be availableCalled wait state or pipeline stallDegrades processor performance

Adds stall clock cycles to instruction execution

Eliminate cause of stallImprove implementation based on analysis of stallsMain activity of hardware architects

1ideal stall

ideal stall stallIC

CPI

N N CPI CPI CPIIC →

=

+= = + ⎯⎯⎯⎯⎯→ +

large on DLX

processing clock cycles (ideal) + stalled clock cyclescompleted instructions

11

ideal stall

ideal stall stall

CPI CPICPI CPI CPI

= − =+ +

performance degradation

Page 226: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-23Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Structural HazardsConflict over access to resource

No structural hazards in DLX

Typical structural hazard — unified cache hazardInstructions and data in same memory deviceCannot access data and fetch instruction on same clock cycleInstruction fetch waits 1 clock cycle for every data memory access

Loads and Stores

CC1 CC2 CC3 CC4 CC5

InstructionFetch

Instruction and DataMemory

InstructionDecode Execute Data

AccessWriteBack

Address Instruction Address Data

No DLX version implemented

with unified cache

Page 227: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-24Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Stall on Cache Hazard

On CC5 Load Word (LW) instruction blocks Instruction Fetch (IF)No instruction is fetched on CC5No instruction (NOP) is forwarded to ID on CC6NOP = bubble = Φ forwarded to EX on CC7, etc

IF ID EX MEM WB CC1 I1 CC2 LW I1 CC3 I2 LW I1 CC4 I3 I2 LW I1 CC5 φ I3 I2 LW I1 CC6 I4 φ I3 I2 LW CC7 I4 φ I3 I2 CC8 I4 φ I3 I4 φ I4

No DLX version implemented

with unified cache

Page 228: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-25Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Effect of Cache Hazard on CPI

stallCPI ⎛ ⎞⎜ ⎟⎝ ⎠

⎛ ⎞= = × = ×⎜ ⎟⎝ ⎠

⎛ ⎞= ×⎜ ⎟⎝ ⎠

i = type

i,j

i i

i

stall cycles stall cyclesstall cycles

instructions instructions instructi

stalls stalls

stalls stall

stalls of type i 

ins

o

t

ns

stall

ruction

 cycl

s of 

e

ts ytall

s

pe j

stallcache

iIC

IC

CPI

×

⎛ ⎞ ⎛ ⎞= × ×⎜ ⎟ ⎜ ⎟⎝ ⎠ ⎝ ⎠

= ×

∑i

(instruction j only causes stall type j)

i i

data s

instructions of type j

instruction

instructions

stall cycles

talls

data stall

1 stall

st

s

dat

stall cy

a memory

cle

all  load

load store

load store

ICIC IC

IC I

IC

C IC

I C

× + × ×

⎛ ⎞= × × +⎜ ⎟

⎝ ⎠

= × × +

data memory store

data memory acces

1 stall

stall

1 stall

stall

1 stall

sta

s

0.25 loads 0.15 

data memory access

1 cycle

1 stall cycle

1 stall cycle

instrucl tionl

ideal stallCPI CPI CPI

⎛ ⎞⎜ ⎟⎝ ⎠

= ⇒ = + =

instruction

stall cycles0.40

inst    

stores

ruct on 

i1.40

Page 229: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-26Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Data HazardsInstruction result not ready when needed

Operations performed in the wrong orderClassification named for correct order of operations

Read After Write (RAW)Correct I2 reads register after I1 writes to itHazard I2 reads register before I1 writes to it

I2 uses incorrect valueWrite After Write (WAW)

Correct I2 writes to register after I1 writes to itHazard I2 writes to register before I1 writes to it

Incorrect value stays in register Write After Read (WAR)

Correct I2 writes to register after I1 reads itHazard I2 writes to register before reads I1 it

I1 uses incorrect valueRead After Read (RAR)

No hazard — reads do not affect registers

Page 230: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-27Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Data Hazards in DLXv2RAW hazards

DLX registers updated in stage 5Next instruction may read register in stage 2Possible hazard to be avoided

WAW hazards cannot occur

DLX writes in uniform order Memory updated in MEMRegisters updated in WB

All updates performed in order of executionI2 cannot perform WB or MEM before I1 performs WB or MEM

WAR hazards cannot occur

Loads performed in MEM and register reads in IDStores performed in MEM and registers updated in WBI2 cannot perform WB or MEM before I1 performs ID or MEM

CC1 CC2 CC3 CC4 CC5

InstructionFetch

InstructionMemory

InstructionDecode Execute Data

Access

DataMemory

WriteBack

Address Instruction Address Data

Page 231: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-28Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Register‐Register RAW Dependencies in DLXv2 Program with register-register dependencies

I1 ADD R1,R2,R3 I1 has R1 as destinationI2 SUB R4,R5,R1I3 AND R6,R7,R1 I2 — I4 have R1 as sourceI4 OR R8,R9,R1

IF ID EX MEM WB CC1 ADD CC2 SUB ADD CC3 AND SUB ADD CC4 OR AND SUB ADD CC5 OR AND SUB ADD CC6 OR AND SUB CC7 OR AND CC8 OR

Bad timing (uncorrected execution)I1 updates R1 in WB during CC5I2 reads R1 in ID during CC3I3 reads R1 in ID during CC4I4 reads R1 in ID during CC5

Page 232: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-29Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Detailed View of CC5 (Uncorrected) in DLXv2

SUB and AND instructions suffer RAW hazard — read wrong value of R1

OR instruction reads correct value of R1

IF/IDIF

Logic ID/EXID

Logic EX/MEMEX

Logic MEM/WBMEMLogic

WBLogic

CC5

PC

SUBAND ADDOR

EX/MEM.ALU sees wrong AND result

END of CC5:

ID/EX.R1 sees wrong value for ORR1 stores ADD result

START of CC5: MEM/WB.ALU sees wrong SUB result

ADD result stored in R1ID/EX.R1 latches correct value for OR

EX/MEM.ALU latches wrong AND result

MEM/WB.ALU latches wrong SUB result

Page 233: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-30Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Pipeline Stall to Avoid RAW Hazard in DLXv2

Wait states during CC3 and CC4ID/EX freezes internal state on SUBIF/ID freezes internal state on AND (cannot enter ID until SUB

finishes and moves to EX) ID performs NOP (no operation) to avoid reading old value of R1ID/EX passes φ (NOP) to EX

Continuation — no hazard in CC5WB operation performed at start of clock cycleLatching of register values in ID performed at end of clock cycle

IF ID EX MEM WB CC1 ADD CC2 SUB ADD CC3 AND SUB ADD CC4 AND SUB φ ADD CC5 AND SUB φ φ ADD CC6 OR AND SUB φ φ CC7 OR AND SUB φ CC8 OR AND SUB OR AND OR

The DLX control system must be able to identify all hazards and insert stall cycles when necessary.

Page 234: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-31Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Pipeline Stall in Instruction View in DLXv2

Performance degradation too large

stall cycles stalls instruction types

stalls instruction type instruction

2 stall cycle 0.5 register dependencies 0.4 ALU

stall ALU instruction instruction

cycles2 0.5 0.4

instructio1.4 (29%

n

stallCP

I

I

CP

= × ×

= × ×

= × × ⇒=⇒ degradation)

Wait states — ID/EX freezes state and passes NOP (no operation) to EX

40%ALUIC

IC=

Clock Cycle

1 2 3 4 5 6 7 8 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF ID ID ID EX MEM WB AND R6,R7,R1 IF IF IF ID EX MEM OR R8,R9,R1 IF ID EX

Page 235: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-32Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Forwarding or Bypass (DLX Version 3)ADD writes ALU result to R1 in CC5SUB needs R1 for ALU operation in CC4AND needs R1 for ALU operation in CC5

Trick to prevent stallADD calculates ALU result in CC3Allow SUB and AND to read incorrect value in IDProvide correct value from EX/MEM.ALU and MEM/WB.ALU directly to EX

InstructionFetch

InstructionMemory

InstructionDecode Execute

DataMemoryAccess

DataMemory

WriteBack

Address Instruction Address Data

IF ID EX MEM WB CC1 ADD CC2 SUB ADD CC3 AND SUB ADD CC4 OR AND SUB ADD CC5 OR AND SUB ADD CC6 OR AND SUB CC7 OR AND CC8 OR

DLX Version 3

Page 236: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-33Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

DLX Pipelined Implementation in DLXv3

MUXes in EX choose from NPC, A, B, I, EX/MEM.ALU, MEM/WB.ALU

Page 237: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-34Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Forwarding in Instruction View in DLXv3

Processor moves state of ADD instruction from buffer to bufferSUB needs ALU result in CC4

ADD provides ALU result from EX/MEM.ALUAND needs ALU result in CC5

ADD provides ALU result from MEM/WB.ALU

Clock Cycle

1 2 3 4 5 6 ADD R1,R2,R3 IF ID EX MEM WB SUB R4,R5,R1 IF ID EX MEM WB AND R6,R7,R1 IF ID EX MEM OR R8,R9,R1 IF ID EX

0No stall cycles for Register-Register RAW hazard

stallCPI =

Page 238: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-35Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Register‐Load RAW Dependencies in DLXv3Program with register-load dependencies

I1 LW R1,32(R2) I1 has R1 as destinationI2 SUB R4,R5,R1I3 AND R6,R7,R1 I2 — I4 have R1 as sourceI4 OR R8,R9,R1

IF ID EX MEM WB CC1 LW CC2 SUB LW CC3 AND SUB LW CC4 OR AND SUB LW CC5 OR AND SUB LW CC6 OR AND SUB CC7 OR AND CC8 OR

Bad timing (uncorrected execution)I1 updates R1 in WB during CC5I2 reads R1 in ID during CC3I3 reads R1 in ID during CC4I4 reads R1 in ID during CC5

Page 239: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-36Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Memory Forwarding or Bypass (Version 4)LW writes loaded data to R1 in CC5SUB needs R1 for ALU operation in CC4AND needs R1 for ALU operation in CC5

Trick to minimize stallLW loads loaded data in CC4Allow SUB to read incorrect value in IDStall SUB for 1 clock cycle in ID (load performed later than ALU operation)Provide correct value from MEM/WB.LMD directly to EX

InstructionFetch

InstructionMemory

InstructionDecode Execute

DataMemoryAccess

DataMemory

WriteBack

Address Instruction Address Data

IF ID EX MEM WB CC1 LW CC2 SUB LW CC3 AND SUB LW CC4 OR SUB φ LW CC5 AND SUB φ LW CC6 OR AND SUB φ CC7 OR AND SUB CC8 OR AND CC9 OR

DLX Version 4

Page 240: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-37Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

DLX Pipelined Implementation in DLXv4

MUXes in EX choose from NPC, A, B, I, EX/MEM.ALU, MEM/WB.ALU,MEM/WB.ALU

Page 241: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-38Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Forwarding in Instruction View in DLXv4

Loaded data used immediately in ALU operation in about 50% of loads

load

stall

ICIC

CPI

CP

= × ×

= × ×

= × =

stall cycles stalls instruction types

stalls instruction type instruction

1 stall cycle 0.5 ALU uses loaded data

stall Load instruction

cycles cycles0.50 0.25 0.125

instruction instruction

I ⇒= 1.125 (11% degradation)

Clock Cycle

1 2 3 4 5 6 7 LW R1,32(R2) IF ID EX MEM WB SUB R4,R5,R1 IF ID ID EX MEM WB AND R6,R7,R1 IF IF ID EX MEM OR R8,R9,R1 IF ID EX

Page 242: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-39Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Register‐Store RAW Dependencies in DLXv4Program with register-store dependency

I1 SUB R1,R5,R4 I1 has R1 as destinationI2 SW 32(R2),R1 I2 has R1 as source

IF ID EX MEM WB CC1 SUB CC2 SW SUB CC3 SW SUB CC4 SW SUB CC5 SW SUB CC6 SW

Bad timing (uncorrected execution) in DLXv4I1 updates R1 in WB during CC5I2 reads R1 in ID during CC3

Trick to prevent stall (Version 5)SW reads incorrect value in IDProvide correct value from MEM/WB.ALU directly to data memory

Page 243: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-40Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

DLX Pipelined Implementation — Version 5

New MUX in MEM chooses B or MEM/WB.ALU

Page 244: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-41Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Compiler Scheduling to Prevent RAW Hazards

C program codeI = I + 123;J = J – 567;

1 2 3 4 5 6 7 8 9 10 11 12 LW F D X M W ADD F D D X M W SW F F D X M W LW F D X M W SUB F D D X M W SW F F D X M W

First pass compilationLW R2, IADD R2,R2, #123SW I, R2LW R3, JSUB R3, R3, #567

SW J, R3

1 2 3 4 5 6 7 8 9 10 11 12 LW F D X M W LW F D X M W ADD F D X M W SW F D X M W SUB F D X M W SW F D X M W

Second pass compilationLW R2, ILW R3, JADD R2,R2, #123SW I, R2SUB R3, R3, #567

SW J, R3 DLXv5

Page 245: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-42Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

DLX Control HazardOn each clock cycle

PC ← NPC New PC for new instruction fetch in every clock cycle

Control hazardIncorrect address on branch instructions

Stages of branch execution

Action during CCLatched stateClock CycleCLK

Calculate address NPC+I and condID/EX.NPC,I ← NPC,I32

IF/ID.IR "sees" correct instructionPC ← branch address54

PC "sees" correct address via MUX using cond to choose NPC or NPC+IEX/MEM.ALU,cond ← ALU, cond43

Decode of branch instruction, NPC, IIF/ID.IR ← branch21

IF/ID.IR "sees" instruction and PC(I1)Memory ← PC(I1)10

Page 246: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-43Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Pipeline Flush for Control Hazard in DLXv5Pipeline flush

Empty and restart pipelineSimplest solution to implement

IT

...

I3

I2

I1

WBMEMEXIDIFTarget

…………………………

φφ

WBMEMEXIDIFφφIFFall-Through

WBMEMEXIDIFBEQZ R1,IT

987654321

Decode branch and flush pipelinePC "sees" correct address

Fall-Through (NPC) Target (NPC+I)

Correct instruction is fetched

Page 247: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-44Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Performance Degradation for Pipeline Flush

Stalled (wasted) cycles

stall cycles stalls instruction types

stalls instruction type instruction

3 stall cycle 1 branch stall

stall branch instruction

cycles cycles3 0.20 0.60

instruction instruction

1.60 (

branch

stall

ICIC

PI

CPI

C ⇒=

= × ×

= × ×

= × =

38% degradation)

IT

...

I3

I2

I1

WBMEMEXIDIFTarget

…………………………

φφ

WBMEMEXIDIFφφIFFall-Through

WBMEMEXIDIFBEQZ R1,IT

987654321

DLXv5

Page 248: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-45Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Improving Branch Performance — 1Enhancement 1

Earlier instruction fetch after pipeline flushVersion 5 PC "sees" correct address in CC 4 but fetches in CC5Version 6a PC latches correct address when ready — in CC 4

Special CLKfor pipeline flush recovery

cycles2 0.20

instruction

cycles0.40

instruc

1.40 (29% degradation

t

)

ion

stall

C

CP

PI

I

⇒=

= ×

=

DLXv6a

IT

I3

I2

I1

IFTarg

……………

φ

IFφIFF-T

MEMEXIDIFBEQZ

4321

Page 249: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-46Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Improving Branch Performance — 2Enhancement 2 — dedicated ALU for branch address in ID stage

Version 6bBranch address available in CC3PC updates in CC3

cycles1 0.20

instruction

cycles0.20

instruc

1.20 (17% degradation

t

)

ion

stall

C

CP

PI

I

⇒=

= ×

=

DLXv6b

IT

I3

I2

I1

IFTarg

…………

IFIFF-T

EXIDIFBEQZ

321

Page 250: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-47Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Improving Branch Performance — 3Enhancement 3

Versions 5 – 6b Flush entire pipeline Restart with correct branch address

Version 6c Flush entire pipeline on branch takenContinue instruction in IF on branch not taken

Branch address and cond ready

IT

...

I3

I2

I1

WBMEMEXIDIFTarget

…………………………

IF

WBMEMEXIDIFFall-Through

WBMEMEXIDIFBEQZ R1,IT

987654321

Branch taken (cond = 1 ⇒ PC ← NPC + I)

Branch not taken (cond = 0 ⇒ PC ← NPC)DLXv6c

Page 251: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-48Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

DLX Version 6c

Page 252: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-49Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Version 6c Branch Processing — 1 CC1BEQZ fetched to IFPC "sees"PCF-T = NPC = PC+4Points to IFALL-THROUGH

Page 253: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-50Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Version 6c Branch Processing — 2 CC2IF fetches IFALL-THROUGHBEQZ advances to IDCalculatesITARG = NPC+Icond

PC "sees"NPC = PCF-T+4

Points to IFALL-THROUGH+1

Page 254: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-51Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Version 6c Branch Processing — 3 CC3IF fetches IFALL-THROUGH+1BEQZ advances to EXID/EX latchesNPC+Icond

PC "sees" PCTARG = PC+IPoints to ITARG

Page 255: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-52Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Version 6c Branch Processing — 4 CC3PC

Receives special CLKLatches PCTARG = PC+IID fetches ITARGPC "sees"PCTARG+1 = PCTARG+1+4Points to ITARG+1

On CC4IF/ID.IR latches ITARGPC latchesPCTARG+1 = PCTARG+4

Page 256: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-53Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Branch Performance of Version 6cMethod called Predict-Not-Taken

Branch taken — Flush entire pipelineBranch not taken — Continue instruction in IFBetter performance on not taken (no pipeline stall)Ideal method if most branches are not taken

Statistics from SPEC CINTNot taken 33%Taken 67%

stall cycles stalls instruction types

stalls instruction type instruction

stall cycles taken branch

taken branch branch instruction

cycles cycles1 0.67 0.20 0.13

instruction instruction

branch

stall

ICIC

CPI

CPI

= × ×

= × ×

= × × =

1.13 (12% degradation)⇒=

Page 257: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-54Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

DLXv6c Pipeline

InstructionFetch

InstructionMemory

InstructionDecode

IntegerALU

DataMemoryAccess

DataMemory

WriteBack

FloatingPoint Unit

(FPU)

IF ID EX MEM WB

ForwardingALU result to ALU sourceMemory load to ALU source (with 1 CC stall)ALU result to memory store

Other dependencies Require stall until Write-Back of intermediate result

DLXv6c

Page 258: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-55Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

DLXv6c Formal Specification (Integer Pipeline) — 1Instruction Fetch (IF)

Instruction Decode (ID)ID/EX.A ← Reg[IF/ID.IR6-10]ID/EX.B ← Reg[IF/ID.IR11-15]ID/EX.I ← (IR16)

16 ## IF/ID.IR16-31ID/EX.IR ← IF/ID.IRID/EX.NNPC ← IF/ID.NPC + (IR16)

16 ## IF/ID.IR16-31ID/EX.cond ← (Reg[IF/ID.IR6-10] == 0)

Stage Buffers (←)Sample and store inputs on falling CLK"See" new inputs during clock cycle

(between falling CLKs)

Type 0-5 6-10 11-15 16-31 R op rs1 rs2 rd function I op rs rd immediate

⎧← ⎨

⎩⎧

← ⎨⎩

PC + 4, cond = 0PC

ID/EX.NNPC , cond = 1

PC + 4, cond = 0IF/ID.NPC

ID/EX.NNPC , cond = 1

IF/ID. MeIR m[PC]

Page 259: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-56Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Execute (EX)

Memory (MEM)

Write Back (WB)

←←

OUT OUT

OUT

OUT

OUT

MEM / WB.ALU EX/ MEM.ALU

MEM /WB.LMD [EX/ MEM.ALU ] ( )

[EX / MEM.ALU ] EX /M

Mem Load

M

Fowarding: MEM / WB.ALU substituted fo

EM.B ( )

MEM

r B

I/WB. EX

em St

/ ME

e

R

or

M.IR

⎧← ⎨

⎩←

11-1OUT

OU

5

16-20 T

MEM/WB.ALU (I-ALU)[MEM/WB. ]

MEM/WB.LMD (Load)

[MEM/WB. ] MEM/WB.ALU (R-A

IRReg

LU)IRReg

DLXv6c Formal Specification (Integer Pipeline) — 2

⎧← ⎨

OUT OU

O T

T

U

Forwarding: EX / MEM.ALU or MEM / WB.AL

ID/EX.A function ID/EX.B (R - ALU)EX/ MEM.ALU

ID/E

U or

MEM / WB.LMD substituted for A o

X.A o

r B

p ID/EX.I (I- ALU, Memory)

EX/ MEM.B ID/EX.B

EX/ MEM.IR ← ID/E .IRX Type 0-5 6-10 11-15 16-31 R op rs1 rs2 rd function I op rs rd immediate

Page 260: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-57Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Forwarding ALU – ALU

1 2 3 4 5 6 7 8 9

ADD R1, R2, R3 IF ID EX MEM WB

ADD R4, R1, R5 IF ID EX MEM WB

ADD R6, R4, R1 IF ID EX MEM WB

ADD R7, R2, R1 IF ID EX MEM WB

    

Page 261: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-58Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Forwarding Load – ALU

1 2 3 4 5 6 7 8 9

LW R1, 8(R2) IF ID EX MEM WB

ADD R3, R1, R2 IF ID ID EX MEM WB

ADD R4, R3, R1 IF IF ID EX MEM WB

   1 2 3 4 5 6 7 8

LW R1, 8(R2) IF ID EX MEM WB

ADD R4, R4, R1 IF ID ID EX MEM WB

ADD R4, R4, R3 IF IF ID EX MEM WB   1 2 3 4 5 6 7 8

LW R1, 8(R2) IF ID EX MEM WB

ADD R4, R4, R3 IF ID EX MEM WB

ADD R4, R4, R1 IF ID EX MEM WB    

Page 262: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-59Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Forwarding ALU ‐ Store

1 2 3 4 5 6 7 8 9

ADD R1, R3, R2 IF ID EX MEM WB

SW 8(R2), R1 IF ID EX MEM WB

   

1 2 3 4 5 6 7 8 9

ADD R1, R3, R2 IF ID EX MEM WB

ADD R4, R5, R6 IF ID EX MEM WB

SW 8(R2), R1 IF ID ID EX MEM WB

SW 10(R4), R1 IF IF ID EX MEM WB   

Page 263: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-60Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

ALU ‐ Branch

1 2 3 4 5 6 7 8 9

ADD R1, R3, R2 IF ID EX MEM WB

BEQZ R1, targ IF ID ID ID EX MEM WB             

1 2 3 4 5 6 7 8 9

ADD R1, R3, R2 IF ID EX MEM WB

ADD R4, R5, R6 IF ID EX MEM WB

ADD R7, R8, R9 IF ID EX MEM WB

BEQZ R1, targ IF ID EX MEM WB   

Page 264: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-61Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Improvement by Re‐Scheduling in DLXv6c

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ADDI R1, R0, #400 F D X M W SUBI R1, R1, #4 F D X M W LW R2, 0(R1) F D X M W LW R3, 400(R1) F D X M W

Forward R1

LW R5, 800(R1) F D X M W LW R6, C00(R1) F D X M W ADD R4, R2, R3 F D X M W SUB R4, R4, R5 F D X M W ADD R4, R4, R6 F D X M W SW 0(R1), R4

Forward R4

F D X M W BNEZ R1, FFD8 F D X M W

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ADDI R1, R0, #400 F D X M W LW R2, -4(R1) F D X M W LW R3, 3FC(R1) F D X M W

Forward R1

ADD R4, R2, R3 F D D X M W Forward R3 LW R2, 7FC(R1) F F D X M W SUB R4, R4, R2 F D D X M W Forward R2 LW R2, BFC(R1) F F D X M W ADD R4, R4, R2 F D D X M W Forward R2 SW -4(R1), R4 F F D X M W SUBI R1, R1, #4 F D X M W BNEZ R1, -40 F D D D X M W

a[i] = a[i] + b[i] – c[i] + d[i] a[] = 000 – 3FFb[] = 400 – 7FFc[] = 800 – BFFd[] = C00 – FFF

Page 265: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-62Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

General Branch PredictionBranch statistics from SPEC CINT

Branch not taken 33%Branch taken 67%Most branch instructions

Used to build loopsRun more than once

Branch predictionAdvanced techniqueNot implemented in DLX modelUsed in modern RISC processors and Intel x86 since Pentium

Branch predictor Records statistics on branch instructions

Source address, target address, taken/not-taken

Predicts branch behavior based on previous behavior

Page 266: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-63Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Branch Prediction for DLX Pipeline

2. Validate branch instruction in ID stageUsual Calculation:

Target addressCondition flag — taken or not-taken

CC1 CC2 CC3 CC4 CC5

InstructionFetch

InstructionMemory

InstructionDecode Execute Data

Access

DataMemory

WriteBack

Address Instruction Address Data

1. Branch predictor in IF stageIdentifies branch instruction

According to source addressPredicts branch from branch history

TakenPredicts branch target address

Not-takenUses fall-through address

3. After validationUpdate branch predictor

Target addressBranch history

Taken/not-taken

Page 267: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-64Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Branch Prediction Performance

Branch taken — first execution

IT

...

I3

I2

I1

WBMEMEXIDIFTarget

…………………………

IF

WBMEMEXIDIFFall-Through

WBMEMEXIDIFBEQZ R1,IT

987654321

Branch taken — second execution

IT+2

IT+1

IT

I1

WBMEMEXIDIFTarget+2

WBMEMEXIDIFTarget+1

WBMEMEXIDIFTarget

WBMEMEXIDIFBEQZ R1,IT

987654321

Misprediction

Correct prediction

Page 268: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-65Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

Branch Prediction Performance for Simple LoopSimple static loop

2 02 large

stallbranch N BCPI

N B × →= ⎯⎯⎯⎯⎯→× +

⎫⎪⎬⎪⎭

fall-through

ADDI R1, R0, #N ; N iterations

L1: ALU Block

SUBI R1, R1, #1 ; B lines of code

BNEZ R1, L1

I

ADDI R1, R0, # N IF ID EX MEM WB L1: ALU Block IF ID EX MEM WB

< B-2 lines of ALU code >

BNEZ R1, L1 IF ID EX MEM WB Ifall - through IF ID φ φ φ L1: ALU Block IF ID EX MEM WB

< B-2 lines of ALU code >

BNEZ R1, L1 IF ID EX MEM WB L1: ALU Block IF ID EX MEM WB

... < B-2 lines of ALU code >

BNEZ R1, L1 IF ID EX MEM WB L1: ALU Block IF ID φ φ φ Ifall - through IF ID EX MEM WB

R1 = N-1

R1 = N-2

R1= 0

Page 269: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-66Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

More Compiler Optimizations — 1Common sub-expression elimination

Compiler encounters instructions B = 10*(A/3);C = (A/3)/4;

Calculates (A/3) into registerUses register in later calculations

LW R1,AADDI R2,R0,#3DIV R1,R1,R2ADDI R2,R0,#10MULT R1,R1,R2SW B,R1LW R1,AADDI R2,R0,#3DIV R1,R1,R2ADDI R2,R0,#4DIV R1,R1,R2SW C,R1

LW R1,AADDI R2,R0,#3DIV R1,R1,R2ADDI R2,R0,#10MULT R3,R1,R2SW B,R3ADDI R2,R0,#4DIV R3,R1,R2SW C,R3

First-passcompilation

Second-passcompilation

Page 270: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-67Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

More Compiler Optimizations — 2Loop unrolling

Instead of loop compiler replicates instructionsEliminates overhead of testing loop control variable

InliningProcedure call replaced by code of procedure or macro

00 ADDI R2,R0,#0x0504 ADDI R1,R0,#0x0808 LW R3,0x1000(R1)0C JAL 1010 SW 2000(R1),R314 SUBI R1,R1,#0x0418 BNEZ R1,-0x141C ADDI R2,R0,#320 ADD R3,R3,R224 JR R31

00 ADDI R2,R0,#0x0504 LW R3,0x1008(R0)08 ADD R3,R3,R20C SW 2008(R1),R310 LW R3,0x1004(R0)14 ADD R3,R3,R218 SW 2004(R1),R31C ADDI R2,R0,#3

First-passcompilation

Second-passcompilation

Page 271: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

6-68Dr. Martin LandSpeeding Up DLXComputer Architecture — Hadassah College — Spring 2019

More Hardware OptimizationsSuperscaling

Run 2 or more pipelines in parallel Instructions without dependencies execute in parallelUsed in most RISC processors and Pentium 1 – 4, Centrino, Core

Dynamic SchedulingProcessor performs dynamic instruction schedulingSame result as compiler schedulingVery efficient when combined with superscalingUsed in IBM mainframes since 1967Used in Pentium II – 4, Centrino, and Core processors

Register AliasingTasks require logical registers (R0, R1, … as defined in ISA)Physical registers allocated per task from large register poolMultiple tasks use same logical register in parallel

Instruction PredicationUsual test-and-set instructions (SLT, SGT, SEQ, …) set predication flagsInstruction can be run or cancelled according to a predicate flag

Page 272: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-1Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Computer Arithmetic

Page 273: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-2Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Unsigned Integers

{ }

( )( )

( ) ( )( )

2 1 2 3 1 0

2

-1 -2 -3 1 010 1 2

2 2

3 1 0

2 2min m

2

in

... 0,100 ... 0 11 ... 1

2 2 2 ... 2 2

00 ... 0

decimal value of

n n n i

n n

n n nn n n

K

k a a a a a a

k k

K k

k

k a a a a a

k K k

− − −

− − −

= ∈

≤ ≤

= = × + × + × + + × + ×

=

=

Binary representation

Decimal value

( ) ( )

( )( ) ( )( )( )

( )

2 2max max

-2 0

2

2 max

2 max

10

0

11 ... 1 1 100 ... 0

1 1 2 0 2 ... 0 2 2

2 1,

0 2 1

n n

n n n

n

nK k

k k

K k

K k

k

=

= ⇒ + =

+ = × + × + + × =

= −

≤ = ≤ − 2 1 bits in cannot be represented n nk > −

Page 274: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-3Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Signed Integers

{ }

( ) ( )

2 1 2 3 1 0

110

10 2 2 3 1 0

-1 -2 -3 1 02 3 1 0

... 0,1

0 2 10 ...

0 2 2 2 ... 2 2

− − −

− −

− −

= ∈

≤ ≤ −

= =

= × + × + × + + × + ×

n n n i

n

n n

n n nn n

k a a a a a a

kk K k K a a a a

a a a a

Binary representation (two's complement)

Non-negative values

Negative va

( )( )( )( ) ( )( )( )( ) ( )( )

( )( ) ( )( )

110 10 2

2 10

2 10

-12 10

0 2 ' 1

11 ... 1 11 ... 1 ' 1 00 ... 0 1 1

11 ... 0 11 ... 0 ' 1 00 ... 1 1 2

10 ... 0 10 ... 0 ' 1 01 ... 1 1 2

−> ≥ − ⇒ = − +

= ⇒ = − + = − + = −

= ⇒ = − + = − + = −

= ⇒ = − + = − + = −

n

n

k k K k

k k K K

k k K K

k k K K

lues

Page 275: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-4Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Negative Signed Numbers

( )( )

( )( )( ) ( )

( )

110 10 2

2 1 2 3 1 0

2 1 2 3 1 0

2 2 1 1 2 2 0 0

0 2 ' 1

...' (1 ) (1- ) (1- ) ... (1- ) (1- )

' 1 (1 ) (1- ) ... (1- ) 1

11 ... 1 1

2 1 1

2

− − −

− − −

− − − −

> ≥ − ⇒ = − +

=

= −

+ + = + − + + +

⎛ ⎞= +⎜ ⎟

⎝ ⎠

= − +

=

n

n n n

n n n

n n n n

n

n

n

k k K k

k a a a a ak a a a a a

K k k K a a a a a a

K

whic

( )( ) ( ) ( )2 2 2' 1 2+ = − = − +n

n

K k K k K k

h has no representation in bits

overflow bit (ignored for signed)

Page 276: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-5Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

General Formula for Signed Numbers

( )( )

( )

2 0 110

1 2 0 1

1 22 0 1

1 22 0 1

1 22 0 1

1

0 , 0, 1

0 2 2 , 0

2 1 2 2 , 1

0 2 2 , 0

2 2 1

n n

n n n

n nn n

n n nn n

n nn n

n

K a a ak

K a a a a

a a a

a a a

a a a

− −

− − −

− −− −

− −− −

− −− −

⎧ … == ⎨− … =⎩

⎧ × + × + … + =⎪= ⎨ ⎡ ⎤− − × + × + … + =⎪ ⎣ ⎦⎩

× + × + … + ==

− × − × 1 22 0 1

1 22 0 1

1 22 0 1

1 21 2 0

2 2 , 1

0 2 2 , 01 2 2 , 1

2 2

n nn n

n nn n

n nn n

n nn n

a a a

a a aa a a

a a a

− −− −

− −− −

− −− −

− −− −

⎧⎪⎨ ⎡ ⎤ + × + … + =⎪ ⎣ ⎦⎩

⎧ × + × + … + == ⎨

− × + × + … + =⎩

= − × + × + … +

( )( ) ( )2 2' 1 2nK k K k+ = −

Page 277: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-6Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Multiplying Unsigned IntegersLong Multiplication

Algorithm

0 1 1 × 1 0 1 0 1 1 0 0 0 0 1 1 0 1 1 1 1

1 2 3 0

1 2 3 0

Operands

n n n

n n n

a a a a ab b b b b

− − −

− − −

=

=

1 2 3 0 0 0Zero temporary register n n n outP P P P P c− − −= ← ←

{

[ ][ ]

}

0

0

1 2 3 0 1 2 3 0

1 2 3 0 1 2 3 0

1 2 3 0 1

, 0, 1

times

if if

shift right bits

to form new

Result is found in bits

out

out n n n n n n

n n n n n n

n n n n n

n

P ac P

P b a

c P P P P a a a a

P P P P a a a a

P P P P a a

− − − − − −

− − − − − −

− − − −

=⎧← ⎨ + =⎩

2 3 0 na a− −

Page 278: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-7Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Multiplying Signed IntegersAlgorithm

1 2 3 0

1 2 3 0

Operands (non-negative)

n n n

n n n

a a a a ab b b b b

− − −

− − −

=

=

1 2 3 0 0Zero temporary register n n nP P P P P− − −= ←

{

[ ]

}

0

0

1 1 2 3 0 1 2 3 0

1 2 3 0 1 2 3 0

1 2 3 0 1 2

, 0, 1

times

if if

shift right bits

to form new

Result is found in bits

n n n n n n n

n n n n n n

n n n n n n

n

P aP

P b a

P P P P P a a a aP P P P a a a a

P P P P a a a

− − − − − − −

− − − − − −

− − − − − −

=⎧← ⎨ + =⎩

3 0 a

Uses Pn-1 instead of cout

Page 279: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-8Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Dividing Unsigned Integers

Long Division

Algorithm for a / b

1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0

1 2 3 0

1 2 3 0

Operands

n n n

n n n

a a a a ab b b b b

− − −

− − −

=

=

1 2 3 0 0Zero temporary register n n nP P P P P− − −= ←

{

[ ]

}

1 2 3 0 1 2 3 0

2 0 2 0

0 0

0 1,

times

shift bits left to form

if

Remainder is in and Quotient is in

n n n n n n

discard new P new a

n n

n

P a P P P P a a a a

P P bP b

a a a a

P a

− − − − − −

− −

⎡ ⎤⎢ ⎥⎢ ⎥⎣ ⎦

← −⎧≥ ⎨ ←⎩

Page 280: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-9Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Floating Point Numbers (IEEE‐754 Standard) 

fes

Single precisions — 1 bit

e — 8 bits

f — 23 bits

( ) ( ) 1272

1 2541 1. 2 ,

126 127 127− ≤ ≤

= − × ×− ≤ − ≤

s e eN f

e

( ) ( ) 10232

1 20461 1. 2 ,

1022 1023 1023s e e

N fe

− ≤ ≤= − × ×

− ≤ − ≤

NaN (not a number — overflow/underflow)Not zero255

0255

Not zero0

000

Nfe

( ) 1271 0. 2sN f −= − × × ( ) 10231 0. 2sN f −= − × ×

( )1 sN = − ×∞

Special values

Double precisions — 1 bit

e — 11 bits

f — 52 bits

Page 281: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-10Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Floating Point OperationsAddition

Multiplication

=

-3 -4 -7 -7

-7 -3

1.010×2 +1.010×2 = 10100×2 +1010×2

10100

+ 01010

11110

11110×2 1.1110×2

× ×

= →

-3 -4 -7 -7

-7 -7 -7 -7

1.010×2 1.010×2 = 10100×2 1010×2

10100

× 01010

11001000

11001000×2 ×2 1.1001000×2 1.100×2

Page 282: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-11Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Floating Point Multiplication

Multiply unsigned numbers

Rounding algorithm

[ ] ( ) [ ]

1 1

2 2

1 21 2

1271

1272

127 127 1271 2 1 2

( 1) 1. 2

( 1) 1. 2

( 1) 1. 1. 2 ( 1) 1. 1. 2

s e

s e

e es s s e

x f

y f

x y f f f f

+ − −+ −

= − × ×

= − × ×

× = − × × × = − × × ×

1 2 3 0 1 2 3 0 1 21. 1. n n n n n nP P P P a a a a f f− − − − − − ← ×

1nX − 2nX − ... 1X 0X r If 1 0nP − = assemble 2−nP 3−nP 0P 1−na 2−na

1nX − 2nX − ... 1X 0X r If 1 1nP − = assemble 1−nP 2−nP 1P 0P 1−na

1e e← +

( 1)1 2 0... (2 ). is rounding bitn

n nX X X r r− −− − + ×

Page 283: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-12Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Floating Point Addition — 11 1

2 2

1271 1 127

1 21272 2

( 1) 1. 2( 1) 1. 2

( 1) 1. 2

s es e

s e

x fX x x f

x f

−−

⎫= − × × ⎪ = + = − × ×⎬= − × × ⎪⎭

1a. If 2 1e e> then 1 2x x↔ 1b. 1 2d e e= − 1c. 1e e= 1d. 0w = 2. ( )1 2 2 2 ''If then s s x x≠ ← (two's complement) and w ← 1

Pn-1 . Pn-2 Pn-3 ... P0 3a. Construct from x2 xn-1 . xn-2

xn-3 ... x0

3b. Shift P right by d places, shifting in w from left

Pn-1 . Pn-2 ... Pn-d Pn-1-d ... P0 g r w . w ... w Pn-1 ... Pd Pd-1 Pd-2

4. Add P ← P + 1.f1

Page 284: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-13Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

Floating Point Addition — 2

5. If 1 2s s≠ and Pn-1 = 1 and 0outc = then P ← (P)''

6. If 1 2s s= and 1outc = then

shift right Pn-1 Pn-2 Pn-3 ... P1 P0 ← 1 Pn-1 Pn-2 Pn-3 ... P1 r ← P0 e ← e + 1

Otherwise shift P left L times until Pn-1 = 1 P = 0 0 0 ... 1 Ps ... S1 P0 ⇒ P ← 1 Ps Ps-1 Ps-2 ... P0 g r 0 ... 0 1. f ← P e e L← − L = 0 ⇒ r ← g L = 1 ⇒ r is rounding bit L > 1 ⇒ r ← 0

1 1

2 2

1271 1 127

1 21272 2

( 1) 1. 2( 1) 1. 2

( 1) 1. 2

s es e

s e

x fX x x f

x f

−−

⎫= − × × ⎪ = + = − × ×⎬= − × × ⎪⎭

Page 285: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-14Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

DLX FP Pipeline

Floating Point Unit (FPU) ADD/SUB: 4 pipelined stagesMULT: 7 pipelined stagesDIV/SQRT: 24 stages — 15 non-pipelined

InstructionFetch

InstructionMemory

InstructionDecode

IntegerALU

DataMemoryAccess

DataMemory

WriteBack

A1 A2 A3 A4

M1 M2 M3 M4 M5 M6 M7

MULT

ADD/SUB

DIV/SQRTFPU

Page 286: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-15Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

FP Execution

W

W

1716151413121110987654321

SD 0(R2),F0

MULTD F0,F2,F6

ADDD F2,F4,F8

LD F4,0(R2)

MXXXXXXDFFFF

MM7M6M5M4M3M2M1DDDDFF

WMA4A3A2A1DDF

Dependency stallsForwarding

WMXDF

1716151413121110987654321

ADDD F12,F14,F16

ADDD F6,F8,F10

LD F4,0(R2)

WMA4A3A2A1DFPipelined FP ADD

WMA4A3A2A1DF

WMXDF

1716151413121110987654321

MULTD F0,F2,F6

MULTD F2,F4,F8

LD F4,0(R2)

WMM7M6M5M4M3M2M1DF

Pipelined FP MULT

WMM7M6M5M4M3M2M1DF

WMXDF

F = IF D = ID X = EX M = MEM W = WB

Page 287: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

7-16Dr. Martin LandComputer ArithmeticComputer Architecture — Hadassah College — Spring 2019

More Hardware → Fewer Steps → Speedup

Ripple AdderAdds order by order

cout → next order as cin

n – 1 propagation delays from order to order

A0 B0A1 B1An-2 Bn-2An-1 Bn-1

FA

a b

cout s

cin

FA

a b

cout s

cin

FA

a b

cout s

cin

FA

a b

cout s

cin

0

S0S1Sn-2Sn-1cout

...

Look Ahead AdderEach stage produces

Calculate n – 1 values for cin in (large) dedicated hardware-1 -- -1 1 -1-1 -11 -1 ( )i i

i i i i

i iii ii i

s a b cc c caa b gb p+

= ⊕ ⊕⎧⎨ = + = +⎩

0

1 1 0 1 1 0

2 1 0 1 2 2 1 2 1 0

1 0

1 -1 -2 -1 -

0

1 1 0 1 0

2 1 0 2 1 0

2 -3 -1 -2 -3 -4

-1 -2 1 -

2 1 0 0

3

1 -

1 2 3

2 0

0

-5

0

0

-4

[ ][ ]

... ... ...k k k k k k k k k k

k

k k k k k

k k k

k

c cc c c cc c c

pp p p p p pp p p p p p p p p p

p p p p p p p p p pp p p p

gg g g g gg g g g g g

g g gccp

g gg p

−− − −

= +

= + = + + = + +

= + + + = + + +

= + + + +

+ + + 0

Page 288: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-1Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Memory

and I/O Organization

Page 289: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-2Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Principle of Locality Locality — small proportion of memory accounts for most run time

Rule of thumb — For 90% of run time next instruction/data will come from 10% of program/data closest to current instruction

Amdahl's Law — make access to most local memory as fast as possible

Address DataFFFFFFFF

FFFFFFFEFFFFFFFD

00000004

00000003000000020000000100000000

B bytes ofinstruction

anddata

currentaddress

A

A + 5% B

A - 5% B

Percentage of memory accounting for 90% of run time for SPEC 92

Page 290: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-3Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Memory Hierarchy

Long TermStorage

Main Memory(RAM) Cache Register

All Filesand Data

Running Programsand Data

Next FewInstructionsand Data

CurrentData

Memory location inside CPU 

Fast access to small amount of information

Organized by CPU

Memory location in or near CPU

Fast access to important data and instructions 

from RAM

Copy of RAM section

Memory location outside CPU

Stores "all" data and instructions of 

running programs

Organized by addresses

Memory locations outside CPU and RAM

Stores data and instructions of "all" 

programs

Organized by OS

Page 291: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-4Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Memory Hierarchy in RISC Workstation/Server

Flash0.035 to 0.1 msMax ~ 60 TB

Typical 500 GBExternal

Solid State Drive (SSD)

CMOS DRAM10 to 20 nsMax < 8 MB

Typical 0CPU Internal

L3 cache

(Level 3)

Magnetic4 to 20 msMax ~ 8 TB

Typical 1 TBExternalDisk

CMOS DRAM10 to 50 ns~ 4 – 64 GBExternalMain Memory

CMOS SRAM3 to 10 nsMax < 8 MB Typical 2 MB

CPU InternalL2 cache

(Level 2)

CMOS SRAM0.5 to 1.0 nsMax < 64 KB 

Typical 32 KBCPU Internal 

L1 cache

(Level 1)

CMOS latches0.1 to 0.5 nsMax < 2KB

Typical 512 BCPU Internal

General Registers

TechnologyAccess Time in System

QuantityLocationLevel

Page 292: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-5Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

CPU and Memory Hierarchy

InstructionMemory

ALU

IR

NPC

IF/ID

LMD

ALUout

IR

MEM/WB

ALUout

B

EX/MEM

A

B

I

NNPC

IR

ID/EX

cond

data in

address

dataout

DataMemory

datacache

instructioncache

CacheController

externalbus

I/Ocontroller(chipset)

MainMemory(RAM)

Long TermStorage

(Disk)Register

Subsystem

control

address data out

address data outdatain

CPUL2 Cache L2

Page 293: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-6Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

CPU and Memory Hierarchy — 1 

L1 (level 1 cache) holds copy of small section of main memoryMost recently accessed addresses (memory locations)L1 split into physically separate Instruction Cache and Data Cache

CPU accesses L1 cache directlyIf (address in L1 cache) {access performed in 1 clock cycle}Else {

L1 cache accesses cache controllerIf (address in L2 cache) {controller copies contents to L1 from L2}Else {controller copies location to L1 from main memory}}

L1 instructions

CPU

cachecontroller

MainMemoryL2

I/O

DiskL1 data

ALU

Registers

requestupdate access latency >> 1 clock cycle

access in 1 CC

Page 294: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-7Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

CPU and Memory Hierarchy — 2 CPU accesses disk and I/O devices by memory addressing

Part of total address space reserved for I/O and storage devicesDisk write of k bytes — CPU performs k stores to same I/O address

Disk read of k bytes — CPU performs k loads from same I/O addressI/O addresses are not copied to cache (marked NON-CACHEABLE)

Virtual MemoryTotal memory space larger than physical memoryDivided into pages of specific sizeInfrequently used pages moved to "page file" on diskPage properties and location specified in page tablesVirtual address divided into

Page address — points to page table entryOffset — points to byte address in page

L1 data

L1 instructions

CPUcache

controllerMain

MemoryL2

I/O

Disk

Page 295: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-8Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Memory Organization — 1n-bit address space

Physical Address = An-1 An-2 … A1 A0

Can form 2n addresses, from 0 to (2n – 1)

Every byte in RAM has an n-bit address

Processor refers to memory locations by physical RAM addresses

Processor stores memory addresses in n-bit address registers

Data Byte 11111…111 Data Byte 11111…110 Data Byte 11111…101 Data Byte 11111…100

… … Data Byte 00000…111 Data Byte 00000…110 Data Byte 00000…101 Data Byte 00000…100 Data Byte 00000…011 Data Byte 00000…010 Data Byte 00000…001 Data Byte 00000…000 Memory Location Address

memory addresses

n-bit registerCPU

Page 296: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-9Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Memory Organization — 2Copy memory data to cache

How much data to copy?Which data is in cache?How to find addresses in cache?

Copy DATA BLOCK to cache

Block = B bytesPage size (virtual memory) = integer × block size

Sets and Slots

Cache = S sets

Set = W slots Slot = 1 data block

S Sets in CacheS – 1 ...10

W Slots in Set 0

Copy memory block to:Deterministic setAny slot in set

Page 297: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-10Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Memory Organization — 3Address space has n-bit physical address

N = 2n byte address space

Address space divided (logically) into address BLOCKS (lines)

Block size B = 2b bytes / block

Blocks in address space = N / B = 2n – b

(n-b)-bit Block Number = Int (Address / B)

b-bit byte offset = Address % B = 0, 1, … , B – 1 = 2b – 1

Block Number Byte Offset

n - b

n-bit Physical Address

b

Page 298: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-11Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Memory Organization — 4

Cache organized into S = 2s Sets

Set Index = 0, 1, … , S – 1 = 2s – 1

Address Block in cache must be copied into specific Set

Set Index (location of block in cache) = Block Number % S

Tag Set Index Byte Offset

[n - (s + b)] s b(n - b)-bit Block Number

n-bit Physical Address

Page 299: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-12Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Memory Organization — 5

W-way associative organization

Each set contains W = 2w slotsOne block copied to one slotBlocks copied to slots in any convenient orderTAG written near block content to identify which block is in cache

[n – (s + b)]-bit tag = Int (Block Number / S )

Total Cache Size

Set 0

W Slots of set 0

Set 1 ... Set S-1

Tag Set Index Byte Offset

[n - (s + b)] s b(n - b)-bit Block Number

n-bit Physical Address

sets blocks bytes bytescache set block cache

total cache size S W B = × × == × ×

Page 300: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-13Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Example of Memory Organization — 1n = 32-bit addressN = 232 bytes = 4,294,967,296 bytes = 4 GBB = 16 = 24 bytes per block

b = 4-bit block offset

Block Number Byte Offset

428-bit Block Number

32-bit Physical Address

4,294,967,295...4,294,967,280268,435,455

......

...353433322

1

0

Block

31302928272625242322212019181716

1514131211109876543210

Byte Address

N / B = 232-4 blocks = 268,435,456 blocks

= 256 Mblocks

Page 301: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-14Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Example of Memory Organization — 2n = 32-bit address N = 232 bytes = 4,294,967,296 bytes = 4 GBB = 24 bytes per block

Tag Set Index Byte Offset

20 8 428-bit Block Number

32-bit Physical Address

...

268,435,455...153512791023767511255255

...

...

... 268,435,200

12822

1

0

Set

10267705142582

128110257695132571

128010247685122560

Blocks that can be Assigned to Set

S = 256 = 28 sets in cache

s = 8-bit set index = Block Number % 256

N / B = 268,435,456 blocks = 256 Mblocks

Page 302: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-15Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Example of Memory Organization — 3n = 32-bit address N = 232 bytes = 4,294,967,296 bytes = 4 GBB = 24 bytes per block

Tag Set Index Byte Offset

20 8 428-bit Block Number

32-bit Physical Address

2

1

0

Set

…258514

25712815131025

102481920256

Possible Cache Content(Block Numbers)

…12

1524

43201

Tags

13100000000000000000010100000001 = 128110

11010000000100000000000000000101

Address 2050910 = 0x0000501D

S = 256 = 28 sets in cache

s = 8-bit set index = Block Number % 256

W = 4 = 22 way associative

Tag = 32 – (4 + 8) = 20 bits

Total Cache Size = 256 sets/cache × 4 blocks/set × 16 bytes/block = 16 KB

Page 303: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-16Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Cache Definitions and PoliciesCache hit

CPU finds block in cache

Cache missCPU needs block not in cacheCache loads block on cache read miss

Write allocateCache loads block on cache write miss

No write allocateWrite to RAM without loading block on cache write miss

Swapping out cache blockNeed a new block in a full setRemove block that is LEAST RECENTLY USED (LRU) WRITE BACK — update RAM when block is swapped outWRITE THROUGH — update RAM on every write to cache

Page 304: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-17Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Cache Performance IssuesL1 Cache hit

CPU reads/writes memory in 1 clock cycleL1 Cache Miss

CPU stalls while L1 loads missing blockMiss Rate

Cache misses per cache accesses (instruction fetch, load, store)Depends on cache size and organization

Miss PenaltyNumber of stall cycles while cache loads missing blockDepends on cache hardware technology and organization

For 1 level unified (not split) cache

load store

stall

IC ICIC

CPI

IC+

= × ×

+= × ×

stall cycles stalls instruction types

stalls instruction type instruction

miss penalty miss rate

Page 305: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-18Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Performance of 1‐Level Split Cache

Typical valuesInstruction Miss Penalty = Data Miss Penalty = 50 cycles per stall

Instruction Miss Rate = 0.5%Data Miss Rate = 5.0%ICload = 25%

ICstore = 15%

stallCPI = × ×

+ × ×

=

instruction stall cycles instruction stalls instruction accesses

stalls instruction access instruction

data stall cycles data stalls data accesses

stalls data access instruction

instruction miss pe

load storeICIC

IC

× ×

++ × ×

1 instruction accessnalty instruction miss rate

instruction

data miss penalty data miss rate

50 0.005 1 50 0.05 0.40 0.251.25 20%

stall

CPCPI

I= × × + × × =

= ⇒ degradation

Page 306: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-19Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Two‐Level Unified Cache — Definitions

1

2

1 2

L

L

L L

A

P

P

P P

ICIC

=

=

=

=

=

=

+

Miss Penalty (cycles) from L1 to L2stall cycles

(L1 miss, L2 hit)Miss Penalty (cycles) from L2 to Main Memory

stall cycles(L1 miss, L2 miss)

Data Access Instructions (Load/Store)Total Instructions

Miss Penalties

1

1

2

2

1

1

=

− =

=

− =

Miss Rate at L1L1 miss

=L1 accessHit Rate at L1

L1 hit=

L1 accessMiss Rate at L2L2 miss

=L2 accessHit Rate at L2

L2 hit=

L2 access

L

L

L

L

M

M

M

M

Miss Rates CPU

UnifiedL2

cache

MainMemory

UnifiedL1

cache

1LP 2LP

1LM 2LM

Page 307: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-20Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Performance of Two‐Level Unified Cache

n

nn×∑stall

stall types=

stall cycles stall cycles stalls of type CPI = =

IC stall type IC

( ){ }

ii i= ∑

= instruction, data

stall types (L1 miss, L2 hit) ,(L1 miss, L2 miss)

CPU MemoryAccess

L1 Hit No Penalty

L1 MissL2 Hit

L2Access L1 Penalty = L2 Access Time

L2 MissL2 Penalty = Main RAM Access Time

CPU

UnifiedL2

cache

MainMemory

UnifiedL1

cache

1LP 2LP

1LM 2LM

Page 308: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-21Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Stalls in Two‐Level Unified Cache

i

i

i

i

ii

i

⎡= ×⎢

⎤+ × ⎥

= ×

stall

{data, instr}

{data, instr}

=

(L1 miss, L2 hit)stall cyclesCPI

(L1 miss, L2 hit) IC

(L1 miss, L2 miss)stall cycles(L1 miss, L2 miss) IC

(L1 miss, L2stall cycles(L1 miss, L2 hit)

=

i i

i

i i

i i

⎡×⎢

⎤+ × × ⎥

hit) L1 accessL1 access IC

(L1 miss, L2 miss) L1 accessstall cycles(L1 miss, L2 miss) L1 access IC

CPU

UnifiedL2

cache

MainMemory

UnifiedL1

cache

1LP 2LP

1LM 2LM

Page 309: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-22Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Assume L1 and L2 Are Statistically Independent

1

(L1 miss, L2 hit) L1 miss followed by L2 hit L1 miss L2 access L2 hitL1 access L1 access L1 access L1 miss L2 access

L1 miss L2 hit

L1 access L2 access

(L1 miss, L2 miss) L1 miss L1 access

i i i

i i i i

i

i

i i

i

= = × ×

= ×

=

1

stall

= {data, instr}

followed by L2 miss L1 miss L2 access L2 missL1 access L1 access L1 miss L2 access

L1 miss L2 missL1 access L2 access

L1 missstall cyclesCPI

(L1 miss, L2 hit) L1 access

i

i i i

i

i

i

ii

= × ×

= ×

= ×∑ L1 accessL2 hitL2 access IC

L1 miss L1 accessstall cycles L2 miss(L1 miss, L2 miss) L1 access L2 access IC

i

i

i i

i i

⎡× ×⎢

⎤+ × × × ⎥

Page 310: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-23Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Simplifying

1 1

i i

i i

i i

i i

L L

i

P M

⎡= × × ×⎢

⎤+ × × × ⎥

= ×

∑stall

= {data, instr}

L1 miss L1 accessstall cycles L2 hitCPI

(L1 miss, L2 hit) L1 access L2 access IC

L1 miss L1 accessstall cycles L2 miss(L1 miss, L2 miss) L1 access L2 access IC

( )

( )

[ ]

[ ]

2 1 2 1 2

1 1 2 1 2 2

1 1 1 2 1 2 2 2

1 1 2 2

1 ( )

1 1 ( )

1

1

A A

L L L L L

A

L L L L L L

A

L L L L L L L L

A

L L L L

IC IC IC ICM P P M MIC IC

ICM P M P P MIC

ICM P P M P M P MIC

ICM P P MIC

+ +× − × + + × × ×

⎡ ⎤= × + × × − + + ×⎡ ⎤⎢ ⎥ ⎣ ⎦

⎣ ⎦⎡ ⎤

= × + × − × + × + ×⎢ ⎥⎣ ⎦

⎡ ⎤= × + × + ×⎢ ⎥

⎣ ⎦stall

CPI

Page 311: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-24Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Split L1 Cache with Unified L2 Cache

1 1 11

misses at L1 misses at L1 accesses at L1IC accesses at L1 IC

misses at L1data misses at L1 instruction misses at L1data accesses at L1 instruction accesses at L1 acces

i

A A

L L LIC ICM M MIC IC

= ×

⎡ ⎤⎢ ⎥⎢ ⎥⎣ ⎦

+ =

= × + = + ×

( )

1 1

1 1

1 1 2 2

1 1 1 2 2

1

1

data,instructions

instructions data

I D

stallunified

stall I Dsplit

accesses at L1ses at L1 IC

CPI

CPI

i

ii

A

L L

A

L L

A

L L L L

A

L L L L L

ICM MIC

ICM MIC

ICM P P MIC

ICM M P P MIC

=

⎡ ⎤⎡ ⎤⎢ ⎥ ⎣ ⎦⎢ ⎥⎣ ⎦

⎡ ⎤⎢ ⎥⎢ ⎥⎣ ⎦

= × + ×

= + ×

= × + × + ×

= + × × + ×

×∑

⎡ ⎤⎣ ⎦

Page 312: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-25Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Second Layer Cache (L2) with Split L1Definitions

One layer (L1) of split cache

Miss Penalty at L1 (to main memory) PL1 ~ 50 cycles

Split L1 cache and unified L2 cacheMiss Penalty at L1 (to L2) PL1 ~ 5 cyclesMiss Penalty at L2 (to main memory) PL2 ~ 45 cyclesMiss Rate at L2 ~ 1%

11 11stall

levelDL L

AIL

ICMCP PC

I MI−

⎡ ⎤+⎢ ⎥

⎦××

⎣=

1

2

L

LA

PP

ICIC

=

=

=

=

Miss Penalty (cycles) at L1

Miss Penalty (cycles) at L2

Data Access Instructions (Load/Store)

Total Instructions

( )12 1 2 21

AI Dstall

level L L LL LICM MI

CPI P P MC−

⎡ ⎤+ × × +⎥

⎦×⎢=

1

1

2

ILDL

L

MMM

=

=

=

Instruction Miss Rate at L1

Data Miss Rate at L1

Miss Rate at L2

1 2 2 5 45 0.01 5.45L L LP P M+ × = + × =

CPU

UnifiedL2

cache

Split L1cache

Instructioncache

Datacache

MainMemory

1LP 2LP

Page 313: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-26Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Issues Affecting Miss RateCompulsory miss

Block not copied to cache until first access to byte in blockFirst access to block always misses in cache Compulsory miss not affected by cache properties

Capacity missCache is smaller than main memorySome blocks removed from cache to make room for required blocksCapacity miss rate lower for larger cache size

Conflict missBlock must be copied to specific setSome blocks removed from set to make room for required blocks

Not caused by overall capacity missFor example, misses caused by address aliasing

Conflict miss rate lower when block placement more flexible

Page 314: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-27Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Cache Miss Statistics — 1

Miss rate components (relative percent) (Sum = 100% of total miss rate)

Cache size

Associativity Total miss rate

Compulsory Capacity Conflict

1 KB 1-way 0.133 0.002 1% 0.080 60% 0.052 39%1 KB 2-way 0.105 0.002 2% 0.080 76% 0.023 22%1 KB 4-way 0.095 0.002 2% 0.080 84% 0.013 14%

1 KB 8-way 0.087 0.002 2% 0.080 92% 0.005 6%2 KB 1-way 0.098 0.002 2% 0.044 45% 0.052 53%2 KB 2-way 0.076 0.002 2% 0.044 58% 0.030 39%2 KB 4-way 0.064 0.002 3% 0.044 69% 0.018 28%2 KB 8-way 0.054 0.002 4% 0.044 82% 0.008 14%4 KB 1-way 0.072 0.002 3% 0.031 43% 0.039 54%4 KB 2-way 0.057 0.002 3% 0.031 55% 0.024 42%4 KB 4-way 0.049 0.002 4% 0.031 64% 0.016 32%4 KB 8-way 0.039 0.002 5% 0.031 80% 0.006 15%8 KB 1-way 0.046 0.002 4% 0.023 51% 0.021 45%8 KB 2-way 0.038 0.002 5% 0.023 61% 0.013 34%8 KB 4-way 0.035 0.002 5% 0.023 66% 0.010 28%8 KB 8-way 0.029 0.002 6% 0.023 79% 0.004 15%

Hennessy and Patterson, figure 5.9

Page 315: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-28Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Cache Miss Statistics — 2 Miss rate components (relative percent)

(Sum = 100% of total miss rate) Cache size

Associativity Total miss rate

Compulsory Capacity Conflict

16 KB 1-way 0.029 0.002 7% 0.015 52% 0.012 42%16 KB 2-way 0.022 0.002 9% 0.015 68% 0.005 23%

16 KB 4-way 0.020 0.002 10% 0.015 74% 0.003 17%16 KB 8-way 0.018 0.002 10% 0.015 80% 0.002 9%32 KB 1-way 0.020 0.002 10% 0.010 52% 0.008 38%32 KB 2-way 0.014 0.002 14% 0.010 74% 0.002 12%32 KB 4-way 0.013 0.002 15% 0.010 79% 0.001 6%32 KB 8-way 0.013 0.002 15% 0.010 81% 0.001 4%64 KB 1-way 0.014 0.002 14% 0.007 50% 0.005 36%64 KB 2-way 0.010 0.002 20% 0.007 70% 0.001 10%64 KB 4-way 0.009 0.002 21% 0.007 75% 0.000 3%64 KB 8-way 0.009 0.002 22% 0.007 78% 0.000 0%

128 KB 1-way 0.010 0.002 20% 0.004 40% 0.004 40%128 KB 2-way 0.007 0.002 29% 0.004 58% 0.001 14%128 KB 4-way 0.006 0.002 31% 0.004 61% 0.001 8%128 KB 8-way 0.006 0.002 31% 0.004 62% 0.000 7%

Page 316: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-29Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Total Miss Rate

Total miss rate drops as capacity or associativity increases

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128KB

1-way2-way4-way8-wayCapacity Misses

Page 317: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-30Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Conflict Miss Rate

Conflict miss rate drops as associativity increases

-

0.010

0.020

0.030

0.040

0.050

0.060

1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB

1-way2-way4-way8-way

Page 318: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-31Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Associativity Trade‐OffBlock can be anywhere in set

Finding (or missing) block in set requires searching every tag in setLarger associativity ⇒ more blocks per set ⇒ longer search time

Tag Set Index Byte Offset

[n - (s + b)] s b(n - b)-bit Block Number

n-bit Physical Address

For fixed cache capacity = S × W × BLarger associativity W

⇒ fewer sets S ⇒ smaller set index s

⇒ larger tag size n – (s + b)⇒ longer tag search

Small advantage beyond 4-way associativity

Page 319: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-32Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Miss Example — Extreme Locality

Programint i,a;for (i = 0 ; i < 4096 ; i++){

a = i;}

Memory accesses4096 write accesses to integer a

1 compulsory cache miss (write allocate) on i = 04095 cache hits on i > 0

0 read accesses to a

Miss rate

Compiler assignmentsRegister ← iMemory ← a

41 2.44 104096

−= = ×miss

miss rateaccesses

Page 320: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-33Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Miss Example — Extreme Non‐Locality

Block size16 bytes/block = 4 words/block = 4 array elements

Programint i,a[4096];for (i = 0 ; i < 4096 ; i++){

a[i] = i;}

Memory accesses4096 write accesses to integer array a[]

Compulsory cache miss every 4 array elements 4096/4 = 1024 cache misses (write allocate)

0 read accesses to a[]Miss rate

1024 0.254096

= =misses

miss rateaccesses

Compiler assignmentsRegister ← iMemory ← a[]

Page 321: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-34Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Miss Example —Good Locality

Programint a[4096],i,j;for (i = 0 ; i < 10 ; i++){

for (j = 0 ; j < 4096 ; j++){a[j] = i + j;}

}

Memory accesses40960 write accesses to integer array a[]

i = 0: Compulsory cache miss every 4 elements i > 0: Entire array in cache ⇒ cache hits

Miss rate

Cache parameters16 KB = 4 Kwords425616

Capacity = B × S × WWSB

Compiler assignmentsRegister ← i,jMemory ← a[]

3210

4 slots

7685122560

7695132571

7705142592

…………

1023767511255

sets

1024 0.02540960

= =misses

miss rateaccesses

4K integers = 16 Kbytes= 1024 blocks

Page 322: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-35Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Miss Example — Extreme Non‐Locality

Programint a[5120],i,j;for (i = 0 ; i < 10 ; i++){

for (j = 0 ; j < 5120 ; j++){a[j] = i + j;}

}

Memory accesses51200 write accesses to integer array a[]

Compulsory cache miss every 4 elements LRU: Next element in never cache

Miss rate

Cache parameters16 KB = 4 Kwords425616

Capacity = B × S × WWSB

Compiler assignmentsRegister ← i,jMemory ← a[]

768

512

256

0

1024i = 0

i = 1

...

51225601024

7685122560

Set 0 (Bold = Miss)

768512256

1024

768512

01024

76825601024

12800 0.2551200

missesmiss rate

accesses= =

Page 323: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-36Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Miss Example — Better Locality Using MRU

Programint a[5120],i,j;for (i = 0 ; i < 10 ; i++){

for (j = 0 ; j < 5120 ; j++){a[j] = i + j;}

}

Memory accesses with MRU51200 write accesses to integer array a[]

i = 0: Compulsory cache miss every 4 elements i > 0: Conflict misses to slot 3 (2 out of 5 accesses)

j=0…256, 1024…1279

Miss rate

Cache parameters16 KB = 4 Kwords425616

Capacity = B × S × WWSB

Compiler assignmentsRegister ← i,jMemory ← a[]

4 slots

1 320

256

257

259

511

10247685120

10257695131

10267705142

…………

12791023767255

1280 512 9 20.025 1 9 0.11551200 5+ × ⎛ ⎞= = × + × =⎜ ⎟

⎝ ⎠misses

accessesmiss rate

5K integers = 1280 blocks

Page 324: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-37Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Miss Example — Improved Locality

Programint a[5120],i,j;for (i = 0 ; i < 10 ; i++){

for (j = 0 ; j < 4096 ; j++){a[j] = i + j;}

for (i = 0 ; i < 10 ; i++){for (j = 4096 ; j < 5120 ; j++){a[j] = i + j;}

}

Memory accesses51200 write accesses to integer array a[]

i = 0: Compulsory cache miss every 4 elements i > 0: Entire array in cache ⇒ cache hits

Miss rate

Cache parameters16 KB = 4 Kwords425616

Capacity = B × S × WWSB

Compiler assignmentsRegister ← i,jMemory ← a[]

1024 256 0.02540960 10240

missesmiss rate

accesses+

= =+ 5K integers = 1280 blocks

7685122560

7695132571

7705142592

…………

1023767511255

7685122561024

7695132571025

7705142591026

…………

10237675111279

Page 325: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-38Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Miss Example — Address Aliasing

Programint a[512],b[512],c[512],i,j;for (i = 0 ; i < 20; i++){

for (j = 0 ; j < 512 ; j++){c[j] = a[j] + b[j] + c[j];

}}

Memory accesses20 × 3 × 512 = 30720 read accesses to a[], b[], c[]20 × 512 = 10240 write accesses to array c[]Set assignment = int(0200AS1S2B/10)%100 = S1S2i = 0: 3 × 512 / 4 = 384 compulsory missesi > 0: 2 × 512 / 4 = 256 conflict misses for sets a[], c[]

Miss rate

Cache parameters8 KB = 2 Kwords225616

Capacity = B × S × WWSB

384 256 19 2688 0.12830720 10240 40960

missesmiss rate

accesses+ ×

= = =+

3 × 512 integers = 6 KB = 384 blocks

10

0

1

7F

FF

2 slots

a[]c[]

b[]

Compiler assignmentsRegister ← i,jMemory (address = 0200AS1S2B)a: 02000000 – 020007FFb: 02001000 – 020017FFc: 02002000 – 020027FF

Page 326: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-39Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Miss Example — Address Aliasing with Larger W

Programint a[512],b[512],c[512],i,j;for (i = 0 ; i < 20; i++){

for (j = 0 ; j < 512 ; j++){c[j] = a[j] + b[j] + c[j];

}}

Memory accesses20 × 3 × 512 = 30720 read accesses to a[], b[], c[]20 × 512 = 10240 write accesses to array c[]Set assignment = int(0200AS1S2B/10)%80 = S1S2i = 0: 3 × 512 / 4 = 384 compulsory missesi > 0: All arrays in cache ⇒ cache hits

Miss rate

Cache parameters8 KB = 2 Kwords412816

Capacity = B × S × WWSB

384 384 0.00937530720 10240 40960

missesmiss rate

accesses= = =

+

3 × 512 integers = 6 KB = 384 blocks

2

c[]

310

0

1

7F

4 slots

a[] b[]

Compiler assignmentsRegister ← i,jMemory (address = 0200AS1S2B)a: 02000000 – 020007FFb: 02001000 – 020017FFc: 02002000 – 020027FF

Page 327: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-40Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Workstation Layout with PCI 

ATA disk controllersPATA— parallel ATASATA— serial ATA

CPU HostBridge

I/O ControllerISA/EISABridge

Long‐Term StorageUser

Interface

network 

ISA bus

Main Memory

I/O Controllers

System Controllers

switching fabric

Page 328: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-41Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

PCI ServicesBoot services

BIOS (basic input/output system)ROM-based software for initiating system boot

TimersSystem timers, counters and real time clocks

Interrupt controllersProgrammable interrupt controlIRQ — interrupt requestsInterrupt messages

Direct Memory Access (DMA)Permit devices to access memory without CPU intervention

Page 329: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-42Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

BIOSHardware system is started or reset

CPU performs self-checkCPU fetches instruction from address FFFF0hAddress FFFF0h contains branch instruction Target of branch is firmware code located in PCI BIOS ROM

ROM = Read Only MemoryUsually E2PROM = Electrically Erasable Programmable ROM

BIOS = BASIC INPUT/OUTPUT SYSTEM

BIOS Locates keyboard, display, boot deviceUEFI (Unified Extensible Firmware Interface) system

BIOS load UEFIHardware-oriented operating system that runs above firmwarePerforms system management including boot of main OS

Non-UEFI systemBIOS begins loading OS from boot device

Page 330: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-43Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Interrupt HandlingAPIC

Advanced Programmable Interrupt Controller

Local APIC Interrupt controller in CPU Local interrupt

INTR + int_number from deviceInterrupt messages

Structured message

I/O APICInterrupt controller in PCI chipsetSends/receives interrupt messagesReplaces old IRQ system (each device

assigned private IRQ)All device interrupts defined as IRQ9Interrupt message describes external event

CPU bus

Page 331: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-44Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Interprocessor InterruptsCPU can send /receive Interprocessor Interrupt (IPI)

Used in multiprocessor (MP) systems Standard APIC interrupt message syntax

Generating IPI message CPU writes to interrupt command register (ICR) in local APIC Local APIC issues IPI message on system bus

Page 332: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-45Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

DMA ⎯ Direct Memory AccessPeripheral device accesses memory directly

No need for CPU to execute transfer instructionsUsed for large data transfers

CPU works concurrentlyCan preempt DMA for cache update

CPU Bus Adaptor

I/O Controller

Long‐TermStorage

UserInterface

Network 

Main Memory

I/O Controllers System Controllers

TimersInterrupts

DMA

Switching Fabric

Page 333: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-46Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

DMA OperationInterrupt mode

CPU Instructions to DMA controller set up transfer Start addressNumber of bytes to transfer

DMA controller Acts as master of data pathTransfers data between RAM and peripheral device IRQ at end of transfer

CPU takes back bus control

PCI arbitrationPCI device

Requests control of bus Requests read / write memory access

PCI bridgeGrants bus control

Page 334: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-47Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Cache Coherency and ConsistencyMultiple processors share data with main memory

Each processor copies data blocks to cacheDifferences can develop between caches and/or main memory

ExampleCPU-1 and CPU-2 read X to cache from main memoryCPU-1 writes to X in cacheInvalid copies of X in CPU-2 cache and main memory

Consistency Copies of data locations are always identical

CoherenceReads and writes occur in the correct orderEasier than consistency

CPU-1with

L1 Cache

L2 cache

Main Memory

CPU-2with

L1 Cache

L2 cache

Page 335: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-48Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Bus SnoopingWrite-back cache policy

CPU updates private cache No update to main memory

CPU evicts cache block Updates main memory

Bus snooping On CPU write

CPU always writes destination addresses on bus (short bus cycle)CPU writes data to private cache (not on bus)

Bus devices monitor all addresses written on busSee which CPUs are loading a cache blockSee which CPUs are writing to a cache block

Write synchronizationBus arbitration prevents writes to multiple copies of cache blockOnly one CPU places target address on memory bus per bus cycleOnly one CPU can write to same cache block in one bus cycle

ArchitecturalState

ExecutionCore

Cache

MainMemory

I/O BusPCI Bridge

CPU 0 CPU 1

ArchitecturalState

ExecutionCore

Cache

CPU 2

ArchitecturalState

ExecutionCore

Cache

Page 336: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-49Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Client/Server Model of I/O

Fast CPU is client for slower I/O servicesBuffer stores client requests and forwards at slower server response rate

LatencyTime between client request and buffer response

ThroughputNumber of services provided per unit time

BandwidthMaximum data transfer rate of I/O channel (including buffer)

CapacityMaximum throughput of server through bufferDepends on bandwidth and service rate (server speed)

UtilizationRequest rate as proportion of capacity

ClientProcessor

FIFO Buffer ServerDevice

request forward

response

Queue

Page 337: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-50Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Buffer Operation

Client requestsIntermittent high speed requests (bursty)Peak client request rate >> Average client request rate

Service responses

Peak client rate > Server rate > Average client rate

FIFO buffers requests in order of arrival

Stores requests arriving at higher client request rateForwards requests to server at lower server response rate

Request forwarding rate = server response rate

ClientProcessor

FIFO Buffer ServerDevice

request forward

response

Queue

Page 338: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-51Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Buffer Latency and Overflow

Minimum latencyDetermined by maximum service response rate

Buffer overflowBuffer fill rate = Client Request Rate – Service Response RateBuffer fills continuously for too long ⇒ buffer overflow

ExamplePeak CPU disk read rate = 1 read/cycle = 109 read requests/secondDisk can provide = 107 responses per second = 100 CPU cycles/readCPU sees minimum latency of about 100 CPU cyclesBuffer can hold 1000 requestsContinuous requests ⇒ overflow in 1000/(109 – 107) ~ 10-6 seconds

ClientProcessor

FIFO Buffer ServerDevice

request forward

response

Queue

Page 339: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-52Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Utilization — Latency Trade‐OffUtilization

Higher client request rateMore services per second ⇒ Higher utilization

Server cannot work faster (service rate is fixed)More requests are buffered ⇒ longer queue length (higher buffer level)

Total latency for one request = server latency + queuing timeMore requests in buffer queue ⇒ longer queuing time

Higher utilization ⇒ Higher total latency

Buffer overflowAverage request rate > average response rateMore requests enter buffer than leaveBuffer level risesAfter long time, buffer overflows

0

2

4

6

8

10

12

14

16

18

20

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

utilization

latencybuffer level

Page 340: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-53Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Queuing Theory — 1

AssumptionsClient Requests

Arrive independently (Poisson statistics)Have random length (bytes to transfer)Average Request Rate in steady state

Buffer Stores requests and forwards in order of arrival (FIFO) at service rateAverage Buffer Level (stored requests) in steady state

ServerProvides services to each request independently (Poisson statistics)Average Service Rate in steady state

ClientProcessor

FIFO Buffer ServerDevice

request forward

response

Queue

Page 341: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-54Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Queuing Theory — 2

Results

ClientProcessor

FIFO Buffer ServerDevice

request forward

response

Queue

Request RateUtilization

Service Rate

1Latency

Service Rate Request Rate

1 1Service Rate 1 Utilization

Buffer Level Latency Request Rate

1Utilization

1 Utilization

=

=−

⎛ ⎞= ⎜ ⎟−⎝ ⎠

= ×

⎛ ⎞= ×⎜ ⎟−⎝ ⎠0

2

4

6

8

10

12

14

16

18

20

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

utilization

latencybuffer level

Page 342: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

8-55Dr. Martin LandMemory and I/O OrganizationComputer Architecture — Hadassah College — Spring 2019

Traffic Shaping

High utilization causes congestionHigher packet error rate (noise + collisions)Buffer overflowRe-transmit lost packets ⇒ more requests ⇒ more collisions

Traffic shapingBuffer at client imposes request quotas on clientClient request rate = maximum transmission rate on networkForward rate = actual transmission rate = optimum network rate

ClientFIFO Buffer

Serverrequest forward

response

Queue

Network

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6

Page 343: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-1Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Advanced Architectures

Page 344: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-2Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

General OutlookFundamental performance parameters

= × ×τ

= +ideal stall

T CPI ICCPI CPI CPI

Areas for possible improvementInstruction and thread level parallelism to achieve

Reducing instruction dependency stalls to lower

Reducing cache latency to lower

Reducing branch stalls to lower

1idealCPI <stalldata dependencyCPI

stallcache missCPI

stallbranch penaltyCPI

( ) 11010 10

1

−−≈ =τ

=

= + +

ideal

stall stall stall stalldatadependency cachemiss branch penalty

IC

CPICPI CPI CPI CPI

for integer pipeline

grows with software complexity

reaching physical limit seconds GHz

Technological limitations

Page 345: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-3Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Practical SuccessesInstruction Level Parallelism (ILP)

Provide multiple copies of hardware units in each processorBegin executing multiple instructions on same clock cycleMultiple instructions finish on every clock cycle ⇒ CPI < 1

Reducing instruction dependency stalls Compiler rescheduling or dynamic rescheduling (out-of-order execution)

Improving floating point performanceParallel process FP instructions

Reducing branch stallsAdvanced branch prediction

Reducing cache latency Processor pre-fetches cache blocks based on address prediction Optimization of data structures

Thread Level ParallelismProvide multiple complete processor coresDivide code into independently executing threads

Page 346: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-4Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Types of ParallelismPipelining

Instruction In+1 begins before In completes 1 instruction completes on every clock cycle τR = 1 / τ instructions complete every second

SuperscalarM > 1 copies of pipeline in parallel Execute M instructions start on same clock cycleM instructions complete on every clock cycle

Superpipelining Divide pipeline into smaller stages — less work per stageLess work ⇒ shorter clock cycle τ' < τ ⇒ higher clock rate R' > RR' > R instructions complete every second

MultiprocessorN > 1 program sections running on N processorsOverall program runs in less time

Page 347: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-5Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

CC1 CC2 CC3 CC4 CC5

IF ID EX MEM WB

DLX PipelineFive pipeline stages

New instruction begins on each clock cycle One instruction completes on each clock cycle

1 2 3 4 5 6 7 8 I1 IF ID EX MEM WB I2 IF ID EX MEM WB I3 IF ID EX MEM WB

4 1

ideal

pipeline

large

large

cycles

instruction

Run‐Time

+= = ⎯⎯⎯⎯→

= × ×τ⎯⎯⎯⎯→ ×τ

idealIC

idealIC

ICCPIIC

CPI IC IC

Page 348: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-6Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Vector ProcessorsSIMD execution model

Single Instruction performed in parallel on Multiple Data

Typical applications for vector operationsData compression/decompressionAudio processing Graphics processing

Page 349: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-7Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

CC1 CC2 CC3 CC4 CC5

IF ID EX MEM WB

DLX Vector Pipeline Example

1 2 3 4 5 6 7 8

p_LW P1, 400(R1) IF ID EX MEM WB

p_LW P2, 800(R1) IF ID EX MEM WB

p_ADD P3, P1, P2 IF ID ID EX MEM WB

Perform 4 word additions in parallelp_ADD P3, P1, P2

Load 4 memory words (16 bytes) to register P2p_LW P2, 800(R1)

Load 4 memory words (16 bytes) to register P1p_LW P1, 400(R1)

Page 350: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-8Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Examples of Vector ProcessingIntel MMX Technology

Integer operations on 64-bit registers8 bytes, 4 words, or 2 dwords

Intel SSE (SSE2/SSE3/AVX) TechnologySimilar to MMX for Floating Point operations

PowerPC AltiVec vector processorSimilar to SSE128-bit registers

Compiler SupportVector instructions part of processor instruction setReasonable compiler supports vectorization

double‐precision FP operations

single‐precision FP operations

Register width

42

84

256 bits = 32 bytes128 bits = 16 bytes

SSE2/SSE3/AVXSSE

Page 351: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-9Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Intel Vectorization Example 

Loop runs 25 timesOperates on 4 array elements

in each loop iteration

int i; float a[100], red; … red = 0; for (i = 0; i < 100; i++) { red += a[i]; }

Loop runs 100 timesOperates on one array element

in each loop iteration

int i; float a[100], red; … red = 0; ASM p_XOR xmm0,xmm0 /* zero 128-bit accumulator */ for (i = 0; i < 25; i++) { ASM p_ADD xmm0,a[4*i] /* four 32-bit additions */ } ASM h_ADD xmm0 /* add four 32-bit FP registers to one */ ASM p_MOV red, xmm0 /* move FP sum to memory location */

Page 352: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-10Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Cache Refresh Latency

Reads 1024 × 1024 SSE operands (16-byte operands) = 16 MBReads sequentially, without repeated access to same data

Pentium 4 has 64 byte block size = 4 x 16-byte operand Will miss in L1 on every 4 accesses

for (i = 0; i < 1024; i++) { for (j = 0; j < 1024; j++) { SSE_operation a[i][j]; } }

4 operationspipeline

I/O bus cacheupdate

4 operations

idle idle

...fetch

miss miss

misspenalty

misspenalty

cacheupdate

Page 353: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-11Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Cache Prefetch

Reads 1024 × 1024 SSE operands (16-byte values) = 16 MBOn Pentium 4 prefetching,

Software prefetch loads 128 bytes (2 cache blocks)128 bytes = (128/16) = 8 SSE operandsPrefetch 8 operands forwardNOP on prefetch of cache hit

for (i = 0; i < 1024; i++) { for (j = 0; j < 1024; j++) { prefetch a[i][j+8]; SSE_operation a[i][j]; } }

4 operationspipeline

I/O bus cache read cache read

...fetch

prefetch prefetch

4 operations

Page 354: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-12Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

P6 Superscalar with Dynamic Rescheduling

Intel P6 architecture for Intel x86 since Pentium II

Fetch/DecodeConvert IA-32 instructions to 1 – 6 RISC-type micro-ops per CC

Instruction pool — out-of-order dynamic reschedulingHolds micro-ops until ready for executionScheduler issues micro-ops to parallel execution in ALU, FPU, Load, StoreFinished micro-ops return to instruction pool with execution results

Retirement (in-order write back)Finished micro-ops write in original program order

InstructionMemory

Write BackDecode

Execution Units

InstructionPool

ALU

FPU

ALU

Store

FPU

Load

Fetchand

Decode

Registers

DataMemory

Page 355: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-13Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Instruction ScoreboardStatus field assigned to instructions in program listing

Instruction executed — destination operand not availableExecutedX

Instruction executed — all destination operand(s) availableFinishedF

All source operands availableReadyR

At least one source operand not available Not ReadyNR

Instructions executed according to Status fields

Only instructions marked Ready can be executedScheduling policy

Depends on hardware organization

Update scoreboard after each clock cycleCompleted instructions marked FinishedInstructions marked Ready as operands become available

Page 356: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-14Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Scoreboard Example for DLX

F

F

F

F

F

F

F

F

F

12

R

F

F

F

F

F

F

F

F

11

NR

R

F

F

F

F

F

F

F

10

NRNRNRNRNRNRNRNRNRSW [Z],R4

NRNRNRNRNRNRNRNRNRSUB R4,R4,#789

XRRRRRRRRLW R4,[Z]

F

F

F

F

F

F

9

F

F

F

F

F

F

8

R

F

F

F

F

F

7

NR

R

F

F

F

F

6

NR

NR

X

F

F

F

5

NR

NR

R

F

F

F

4

NR

NR

R

R

F

F

3

NR

NR

R

NR

R

F

2

NR

NR

R

NR

NR

X

1Instruction

SW [Y],R3

SUB R3,R3,#456

LW R3,[Y]

SW [X],R2

ADD R2,R2,#123

LW R2,[X]

Scheduling rules

Issue only ready instructions

Choose instructions in ORIGINAL PROGRAM ORDER

Scoreboard generates inefficient code

Instruction status after EX clock cycle (ignoring IF, ID, MEM, WB)

Executed and destination operand available

F

ExecutedX

ReadyR

Not ReadyNR

Page 357: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-15Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Scoreboard Example for Dynamic DLX

FRRRNRNRNRNRNRSW [Z],R4

FFFFRRRNRNRSUB R4,R4,#789

FFFFFFXRRLW R4,[Z]

F

F

F

F

F

F

9

F

F

F

F

F

F

8

R

F

F

F

F

F

7

R

F

F

R

F

F

6

R

F

F

R

F

F

5

NR

R

F

R

F

F

4

NR

R

F

NR

R

F

3

NR

NR

X

NR

R

F

2

NR

NR

R

NR

NR

X

1Instruction

SW [Y],R3

SUB R3,R3,#456

LW R3,[Y]

SW [X],R2

ADD R2,R2,#123

LW R2,[X]

Scheduling rulesIssue only ready instructionsChoose instructions in ORDER THEY BECOME READY

Scoreboard generates compiler-rescheduled code

Instruction status after EX clock cycle (ignoring IF, ID, MEM, WB)

Executed and destination operand available

F

ExecutedX

ReadyR

Not ReadyNR

Page 358: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-16Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Scoreboard Example for P6

FRNRNRNRStoreSW [Z],R4

FFRNRNRALUSUB R4,R4,#789

FFFRRLoadLW R4,[Z]

CC6

F

F

F

F

F

F

CC5

F

F

F

F

F

F

CC4

R

F

F

F

F

F

CC3

NR

R

F

R

F

F

CC2

NR

NR

R

NR

R

F

CC1

Store

ALU

Load

Store

ALU

Load

Execution UnitInstruction

SW [Y],R3

SUB R3,R3,#456

LW R3,[Y]

SW [X],R2

ADD R2,R2,#123

LW R2,[X]

FinishedF

ReadyR

Not ReadyNR

Scheduling rulesIssued only ready instructionsAmong ready instructions, maintain program list orderOnly 1 load and 1 store per CCUp to 2 ALU and 2 FPU instructions per CC

Execution condition after each clock cycle (ignoring fetch, decode, write-back)

Page 359: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-17Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

InstructionMemory

Write BackDecode

Execution Units

InstructionPool

ALU

FPU

ALU

Store

FPU

Load

Fetchand

Decode

Registers

DataMemory

Program Execution in P6

ADD [X],123SUB [Y],567SUB [Z],789

IA-32 instructionsdecoded in 2 CCto RISC micro-opswith register renaming

CC5StoreSW [Z],R4

CC4

CC3

CC2

CC1

StoreALU

SW [Y],R3SUB R4,R4,#789

StoreALULoad

SW [X],R2SUB R3,R3,#567LW R4,[Z]

ALULoad

ADD R2,R2,#123LW R3,[Y]

LoadLW R2,[X]LW R2,[X]ADD R2,R2,#123SW [X],R2LW R3,[Y]SUB R3,R3,#567SW [Y],R3LW R4,[Z]SUB R4,R4,#789SW [Z],R4

Dynamic scheduling

Page 360: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-18Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Hardware Utilization

Good program efficiencyProgram executes in minimum number of sequential cycles

Low hardware utilizationMost execution units idle in most clock cycles

Higher ILP ⇒ higher utilization of execution unitsHigher utilization ⇒ larger pool of independent instructions

Speculation — deep branch predictionMany instructions executed before program flow determined

Hardware multithreadingInstructions from different threads are independent

SW [Y],R4SW [Y],R3SW [X],R2IDLEIDLEStore

IDLEIDLELW R4,[Z]LW R3,[Y]LW R2,[X]Load

IDLEIDLEIDLEIDLEIDLEFPU

IDLEIDLEIDLEIDLEIDLEFPU

IDLEIDLEIDLEIDLEIDLEALU

IDLESUB R4,R4,#789SUB R3,R3,#567ADD R2,R2,#123IDLEALU

CC5CC4CC3CC2CC1Unit

Page 361: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-19Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Deep Superpipeline for DLX Divide each pipeline stage into 2 smaller stages

Each new stage does half the work in half the timeNew stage finishes in half the time ⇒ Double clock speed

1 2 3 4 5 6 7 8 9 10 11 12 I1 IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2 I2 IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2 I3 IF1 IF2 ID1 ID2 EX1 EX2 MEM1 MEM2 WB1 WB2

29 1

12 2 2

superpipeline

superpipeline

superpipeline superpipeline

pipeline

ideallarge

pipeline pipelineideal ideal idealpipelinelarge

Double clock speed

IC

IC

ICCPIIC

T CPI IC IC T

ττ

τ τ

⇒ =

+= ⎯⎯⎯⎯→

= × × ⎯⎯⎯⎯→ × = ×

Problems with deep superpipelineSome instructions cannot be effectively splitSome operations do not scale in time — faster clock ⇒ more stall cycles

Cache update, branch penalty, page fault, etc

Page 362: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-20Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Pentium 4 SuperpipelinePentium III

10 stage pipeline at clock speed up to about 1.5 GHzPentium 4

20 stage pipeline at clock speed up to about 4.0 GHzExpect

1.5 GHz processor ~ 50% faster than same processor at 1.0 GHzMeasurement on SPEC CINT2000

1.5 GHz Pentium-4 ~ 20% faster than 1.0 GHz Pentium-III

( )

44

4 4

4 4

4 4

11.51.0 1.2 1.251 1.0

1.51.25

1 1.25 0.25 0

PIII PIIIPIII

P PIIIP

P P

P P PIII PIIIideal stall ideal stall

P PIII P PIIIideal ideal stall stall

CPI IC CPIS CPI CPICPICPI IC

CPI CPI CPI CPI

CPI CPI CPI CPI

× ×= = × = ⇒ = ×

× ×

+ = × +

= = ⇒ = × + ≈

GHz

GHz

.5

Page 363: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-21Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Hyper‐Threading

Two copies of architectural state — one execution coreOS sees two sets of registers — looks like two CPUs

OS assigns threads to CPU 0 and CPU 1CPU 0 and CPU 1 issue instructions to shared execution core

No stall in either threadCPU 0 and CPU 1 issue instructions on alternate clock cycles

Stall in one threadOther CPU issues instructions on each clock cycle until stall ends

Both CPUs keep working on most clock cycles

ArchitecturalState

ExecutionCore

Cache

MainMemory

I/O BusPCI Bridge

CPU 0 CPU 1

ArchitecturalState

Architectural StateRegisters, stack pointersand program counter

Execution CoreALU, FPU, vectorprocessors, memory unit

Page 364: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-22Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Expected Improvement from Hyper‐Threading

( )

( ) [ ]2'

stall cyclesstall cycles per stall

stall

stall cycles2 simultaneous stalls

stall

Without hyper-threading

With hyper-threading

Speedup

stallS

stallS

HT HT HT

CPI P CPS P CPS

CPI P CPS P

CPI ICSCPI IC

ττ

= × = × =

= × ≈ ×

× ×= =

× × [ ]

[ ]

2

2

11

0.52 0.5 / 2 0.25

1 2 0.25 1. 1.221 2 0.25

cycles per stall

Take for Pentium-4

Measured improve e

m nt

SHT

S

stall

S

CPS PCPICPI CPS P

CPICPS P

S

+ ×=

+ ×

≈= ⇒ = =

+= = =

×

+ × Intel, "Hyper-Threading Technology Architecture and Microarchitecture"

Page 365: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-23Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Intel Nehalem Micro‐Architecture

David Kanter, "Inside Nehalem: Intel's Future Processor and System",http://realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT040208182719&mode=print

Page 366: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-24Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Amdahl’s Equation in Parallel Processing

( )

( )

( )

1

1

11 1-

work can be parallelized

work cannot be parallelized

Fraction of processing that can be performed independently Number of processing units

NN processors

N

NN

P P

P

CPICPI PN

CPI P

CPI F CPI FN

FN

S

=−

=

==

= ×

+ ×

= × + ×

==

( )

1 1 1

1-

N N

N processors N processorsP

P

CPI IC CPIFCPI IC CPI FN

ττ

= =

− −

× ×= = =

× × +

Page 367: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-25Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

MP and HT Performance Enhancements

MP Without Hyper Threading

0.65

0.85

S/CPU

2.64

1.72

SCPUs

Hyper Threading Without MP

0.60

S/CPU

1.22

SCPUs

Speed-up (S) for On Line Transaction Processing (OLTP) Workload

Page 368: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-26Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Rise and Fall of Multiprocessor R&D

Ref: Mark D. Hill and Ravi Rajwar, "The Rise and Fall of Multiprocessor Papers in the International Symposium on Computer Architecture (ISCA)",http://pages.cs.wisc.edu/~markhill/mp2001.html

Topics of papers submitted to ISCA1973 to 2001

Sorted as percent of total

ISCA — International Symposium on Computer Architecture

Hennessey and Patterson joke that proper place formultiprocessing in their book is Chapter 11 (a section of USbusiness law on bankruptcy)

Page 369: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-27Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Basic Interprocess Communication ModelsShared memory system

Interprocess communication — write/read shared memory locationSingle shared address space

Sequential coherence enforced by cache snoopingBus imposes write / read orderCache coherency overhead

Message passing systemInterprocess communication — send/receive structured messages

Send / request dataProvide requested data or status

Sequential coherence enforced by message content + synchronization

No snooping or snooping overheadMessage management contributes overhead

Page 370: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-28Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Multiprocessor Shared Memory Multi‐Threading

One or more physical microprocessorsArchitectural state

Registers, including stack pointers and program counterExecution core

Integer ALUs, FPUs, vector processors, memory accessOS assigns a thread to each processor

Each thread runs independentlyOn long stall (page fault) a CPU can switch threads

ArchitecturalState

ExecutionCore

Cache

MainMemory

I/O BusPCI Bridge

CPU 0 CPU 1

ArchitecturalState

ExecutionCore

Cache

Page 371: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-29Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Multi‐Core Shared Memory Multi‐Threading

Multiprocessor system on one physical chip

Cheaper than multi-microprocessor systemCan be bottleneck at memory bus

Both processors need to update cache simultaneouslyOne processor must wait

ArchitecturalState

ExecutionCore

L1 Cache

MainMemory

I/O BusPCI Bridge

CPU 0 CPU 1

ArchitecturalState

ExecutionCore

L1 Cache

L2 Cache

Page 372: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-30Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Open MP for Shared Memory SystemsApplication Program Interface (API) for multiprocessing

Supports shared memory applications in C/C++ and Fortran Provides directives for explicit thread-based parallelizationSimple programming models on shared memory machines

Fork — Join ModelMaster thread (consumer thread)

Programs initiate as single threadExecutes sequentially until parallel construct is encountered

Fork (producer thread)Master thread creates team of parallel threads Program statements in parallel construct execute in parallel

JoinTeam threads complete Synchronize and terminateMaster thread continues

NestingForks can be defined within parallel sectionsRef: https://computing.llnl.gov/tutorials/openMP/

Page 373: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-31Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

General Code Structure

#include <omp.h>main () {int var1, var2, var3;/* Serial code */

...#pragma omp parallel private(var1, var2) shared(var3){/* Parallel section executed by all threads */

.../* All threads join master thread and disband */ }

/* Resume serial code */ ...

}

Variables shared among all threadsOne copy accessed by all threads

Variables private to each threadEach thread has private copy

Page 374: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-32Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

"Hello Worlds" Program#include <omp.h>main () {int nthreads, tid;/* Fork team of threads with private variables */#pragma omp parallel private(tid){/* Obtain and print thread id */tid = omp_get_thread_num();printf("Hello World from thread = %d\n", tid);/* Only master thread does this */if (tid == 0) {nthreads = omp_get_num_threads();printf("Number of threads = %d\n", nthreads);}

} /* All threads join master thread and terminate */}

Page 375: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-33Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Parallel For#pragma omp parallel

{

#pragma omp for

for (i = 0; i < 12; i++)

c[i] = a[i] + b[i];

}

MasterThread

parallel fori = 0i = 1i = 2i = 3

i = 4i = 5i = 6i = 7

i = 8i = 9i = 10i = 11

fork

join

MasterThread

omp parallel

Data decomposition12 loop iterations dividedamong 3 CPU cores

Each core executes 4 loopiterations in parallel

Page 376: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-34Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Sections#pragma omp parallel shared(a,b,c,d) private(i)

{#pragma omp sections

{

#pragma omp sectionfor (i=0; i < N; i++)

c[i] = a[i] + b[i];

#pragma omp sectionfor (i=0; i < N; i++)

d[i] = a[i] * b[i];

} /* end of sections */} /* end of parallel section */

}

Functional decompositionEnclosed sections of code divided among threads in team

MasterThread

parallelsections

c[i] = a[i] + b[i] d[i] = a[i] * b[i]

fork

join

MasterThread

omp parallel

Page 377: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-35Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Message Passing Example — Vector Product

[ ] [ ]*Compute from data pre-distributed to nodes∑ 3i=0 a i b i

load Ra, a

load Rb, b

Ra ← Ra * Rbrecv P2, RbRa ← Ra + Rbrecv P1, RbRa ← Ra + Rbstore p, Ra

load Ra, a

load Rb, b

Ra ← Ra * Rbsend P3, Ra

load Ra, a

load Rb, b

Ra ← Ra * Rbrecv P0, RbRa ← Ra + Rbsend P3, Ra

load Ra, a

load Rb, b

Ra ← Ra * Rbsend P1, Ra

P3P2P1P0

Message overheadSource or destinationTime of creation

Sequential consistency guaranteed by message overheadP3 distinguishes two reads (receives) from P1 and P2 by source addressNo data hazard

Page 378: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-36Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Scatter and Gather

4

3

2

1

Task 3Task 2Task 1Task 0

4321

Task 3Task 2Task 1Task 0

Scatter

Send buffer

Destination buffers

D

C

B

A

Task 3Task 2Task 1Task 0

DCBA

Task 3Task 2Task 1Task 0

Gather

Destination buffer

Send buffers

Page 379: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-37Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

Reduce

4321

Task 3Task 2Task 1Task 0

10

Task 3Task 2Task 1Task 0

Reduce: ADD

Send buffers

Destination buffer

Page 380: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

9-38Dr. Martin LandAdvanced ArchitecturesComputer Architecture — Hadassah College — Spring 2019

MPI "Hello World"#include "mpi.h"main( argc, argv )int argc;char **argv;{char message[20];int myrank; /* myrank = this process number */MPI_Status status; /* MPI_Status = error flags */MPI_Init( &argc, &argv );MPI_Comm_rank( MPI_COMM_WORLD, &myrank );

/* MPI_COMM_WORLD = list of active MPI processes */if (myrank == 0) /* code for process zero */{strcpy(message,"Hello, there");MPI_Send(message, strlen(message)+1, MPI_CHAR, 1, 99, MPI_COMM_WORLD);

}else /* code for process one */{MPI_Recv(message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status);printf("received :%s:\n", message);

}MPI_Finalize();

}

Ref: "MPI: A Message-Passing Interface Standard Version 1.3"

Page 381: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-1Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

Real‐Life RISC

Page 382: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-2Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

MIPS ArchitectureRISC Instruction Set Architecture (ISA)

Defines registers + instructionsMIPS cores

Define device-dependent implementation detailsPipeline organization, I/O organization, control registers, ...

MIPS32 32-bit RISC ISABasis for DLX

MIPS64 64-bit RISC ISABinary compatible with MIPS32

ApplicationsTypically licensed to OEMs Design implemented in embedded systemsMIPS-based PCs used in China

Page 383: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-3Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

MIPS32 ISA — 1Registers

32-bit integer registersR0, R1, ... , R31Regs[R0] = 0 (read-only)

32-bit FP registers F0, F1, ... , F31

Special registersHI, LO

64-bit result of integer multiply Quotient + remainder result of integer divide

Instruction formats

32 26 25 21 20 16 15 0Type 6 5 5 5 5 6

R opcode rs rt rd sa function I opcode rs rt immediate J opcode target

Page 384: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-4Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

MIPS32 ISA — 2Coprocessors

Logical extensions of basic MIPS ISAAccessed via coprocessor read / write instructions

CP0System Control Coprocessor — on CPU Supports virtual memory system and exception handling

Translates virtual addresses into physical addressesControls cache subsystemHandles switches between kernel / supervisor/ user statesManages exceptions / diagnostic control / error recovery

CP1Interface to FPU

CP2Available for device-specific implementations

CP3Interface to FPU on MIPS64 and newer MIPS32

Page 385: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-5Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

MIPS32 ISA — 3Some MIPS instructions not in DLX

rt ← substr(rs,pos=sa,size=rd)EXT rt, rs, pos, sizeExtract 

Multiply and add to HI_LOMADD rs, rt

PrefetchPREFCache 

Trap if equal / greater or equal / not equalTEQ / TGE / TNETrap

System Call  SYSCALL System 

Critical section for shared memorySYNCSynchronize

Branch less / less or equal zeroBLTZ / BLEZ

Branch greater / greater or equal zeroBGTZ / BGEZ Branch 

Multiply to HI_LOMULT rs, rt

Multiply to GRMUL rd, rs, rt

Multiply

Shift Word Left Logical / Arithmetic SLL / SRA

Rotate Word Right ROTR Shift 

Set on Less Than ImmediateSLTI rt, rs, immTest+Set

Coprocessor Load / Store Store Word from Coprocessor_z, z = 1 or 2 SWCz imm(reg), rt

Load Word to Coprocessor_z, z = 1 or 2 LWCz rt, imm(reg)

Page 386: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-6Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

MIPS64 ISARegisters

64-bit integer registersR0, R1, ... , R31Regs[R0] = 0 (read-only)

32-bit FP registers on 32-bit FPU

64-bit FP registers on 64-bit FPUF0, F1, ... , F31

Special registersHI, LO

128-bit result of integer multiply Quotient + remainder result of integer divide

Instruction formats32-bit instruction length — binary compatible with MIPS32MIPS32/64 instructions act on lower 32-bits in registersMIPS64_double instructions act on full 64-bits in registersMemory address = 64-bit pointer (register) + 16-bit immediate

Page 387: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-7Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

ARM OverviewMicroprocessor and microcontroller for embedded systems

Advanced RISC Machine developed by ARM LimitedARM Ltd primarily licenses ISA implementations to developersMost widely used 32-bit RISC ISA

Over 50 billion ARM processors used in phones, games, peripherals98 percent of mobile phones use ARM

Page 388: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-8Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

RISC Architectural FeaturesData Types

Byte 8 bits Halfword 16 bits (in ARMv4 and higher)Word 32 bits

Standard RISCLoad/store architectureLarge uniform register file Simple addressing modesUniform and fixed-length instruction fieldsScalar in-order pipeline

Additional ARM architectural featuresShift + ALU operationsAuto-increment / auto-decrement addressing modes for loops Load and Store Multiple instructionsConditional execution of most instructions

Cancel instructions on certain condition flagsReplaces control hazards in forward jumps

Page 389: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-9Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

ARM Versions

CortexARMv8

CortexARMv7

ARM11ARMv6

XScale

ARM10E

ARM9EARMv5TEARMv5TEJ

ARM9TDMI

ARM8

StrongARM

ARM7TDMI

ARMv4

ARM7

ARM6ARMv3

ARM3

ARM2ARMv2

ARM1ARMv1

Processor FamilyArchitecture Version

ICEI

DSP Enhancement(implies TDMI)

E

Extension Features

Multiplier (64-bit result)M

DebuggerD

ThumbT

Jazelle (Java)J

Page 390: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-10Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

Seven Operating Modes

Runs privileged operating system tasks (ARMv4 and above)Accesses user mode registers

System

Supports software emulation of hardware coprocessorsUndef

Implements virtual memory and/or memory protectionHandles memory access violations

Abort

Protected mode for operating systemEntered on reset and on Software Interrupt

Supervisor

General purpose interrupt handling Entered on low priority (normal) interrupt

IRQ

Supports high speed data transfer or DMA process Entered on high priority (fast) interrupt

FIQ

Normal (non-privileged) program execution modeNo access to protected resources

User

Page 391: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-11Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

Registers32-bit general purpose registers

16 architectural registers r0, … , r15r11 — FP = Frame Pointerr12 — IP = intra-procedure-call scratch registerr13 — SP = Stack Pointer used by push/pop instructionsr14 — LR = Link Register used to return from function callsr15 — PC = Program Counter

31 physical registers Multiple copies of r8, … , r14Each copy accessible in specific operating mode

32-bit status registersCurrent Program Status Register (CPSR) visible in all modes

32 bits wide

5 Saved Program Status Registers (SPSR)Privileged modes copy previous CPSR

Page 392: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-12Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

Modes and Visible Registers

SPSRSPSRSPSRSPSRSPSR

CPSRCPSRCPSRCPSRCPSRCPSR

r15 (PC)r15 (PC)r15 (PC)r15 (PC)r15 (PC)r15 (PC)

r14 (LR)r14 (LR)r14 (LR)r14 (LR)r14 (LR)r14 (LR)

r13 (SP)r13 (SP)r13 (SP)r13 (SP)r13 (SP)r13 (SP)

r12 (IP)r12 (IP)r12 (IP)r12 (IP)r12 (IP)r12 (IP)

r11 (FP)r11 (FP)r11 (FP)r11 (FP)r11 (FP)r11 (FP)

r10r10r10r10r10r10

r9r9r9r9r9r9

r8r8r8r8r8r8

r7r7r7r7r7r7

r6r6r6r6r6r6

r5r5r5r5r5r5

r4r4r4r4r4r4

r3r3r3r3r3r3

r2r2r2r2r2r2

r1r1r1r1r1r1

r0r0r0r0r0r0

Undef AbortSupervisor IRQFIQUser

Page 393: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-13Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

Instruction SetsARM

32-bit instructionsAligned on 32-bit boundaries

Lowest 2 bits of PC (r15) always 0

Thumb16-bit instructions 1-to-1 mapped to 32-bit instructionsShortened versions with restricted options and implicit operands

Example — add dest, src1, src2 becomes add dest, src

Aligned on 16-bit boundariesLowest bit of PC (r15) always 0

Set by T = 1 in CPSRJazelle

Executes Java bytecode directlyARM reads 4 8-bit instructions per instruction fetchSet by J = 1 in CPSR

Page 394: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-14Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

Conditional ExecutionFlag set suffix S

ALU instructions set CPSR flagsN — Negative Z — ZeroC — Carry V — oVerflow

Conditional execution suffixExecute instructions if flag combination is true

Example — operate-compare-if-else

SUBS r3, r3, #1

ADDNE r0, r1, r2

SUBEQ r0, r1, r2ADD r4, r5, r6

SUB r3, r3, #1CMP r3, #0BEQ L1 ; branch on 0ADD r0, r1, r2 ; skip if r3 = 0B L2 ; jump to L2

L1: SUB r0, r1, r2 ; skip if r3 != 0L2: ADD r4, r5, r6

Conditional ExecutionUsual Execution

Page 395: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-15Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

Basic Instructions Transfer

MOV — Move register-to-register or immediate-to-register MVN — Move NotLDR, STR — load / store

BranchB — Add sign-extended 24-bit signed immediate to PC (r15)BL — Branch and store PC+4 in link register (r14) Conditional branch — conditional execution of B or BL

ALU

MLA Multiply and ADDORR Logical ORSUB SubtractSBC Subtract with CarryRSB Reverse SubtractRSC Reverse Subtract with CarryTST TestTEQ Test Equivalence

ADD AddADC Add with CarryAND Logical ANDBIC Logical Bit ClearCMP CompareCMN Compare NegativeEOR Logical XORMUL Multiply

Page 396: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-16Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

Shift and RotateLegal ALU instruction operands

32-bit source / destination register contentsSign-extended 12-bit immediate Shifted-operand

Shifted / rotated 32-bit source register contents Number of shifts set by 8-bit immediate

ShiftsLSL — Logical Left Shift (unsigned)ASR — Arithmetic Right ShiftLSR — Logical Shift Right

Rotates ROR — Rotate RightRRX — Rotate Right Extended (CF into MSB)

Page 397: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-17Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

VFP ExtensionNo FP operations in basic instruction set

Not needed in simple embedded applications VFP implements FP ISA extension as optional coprocessor Since ARM10

Vector Floating Point (VFP) Single precision and double precision floating point computation

ANSI/IEEE Std 754 compliant

Single Instruction Multiple Data (SIMD) FP unitOne FP operation performed in parallel on 256-bit vector

8 single-precision (4-byte) FP numbers4 double-precision (8-byte) FP numbers

Accesses 32 single precision FP registers (32-bit width)

VFPv3Operates on 8 double-precision (8-byte) FP numbers (512-bit vector)

Page 398: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-18Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

VFP InstructionsTransfer

Load / store FP values into registers from memoryTransfer / copy 32-bit values between VFP and ARM GP registersConversions between float, double, unsigned / signed integers

FPUAddSubtractMultiplyDivideSquare root Combined multiply-accumulateCompare FP values in registers

VFPv3 Store FP constant in register

Page 399: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-19Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

DSP EnhancementsDigital Signal Processing

Operations on sampled-digitized-encoded analog informationTypical applications

Audio / video, speech processing, modems, medical instruments

Typical algorithmsD / A, normalization, correlation, convolution, FFT, encoding / decodingReal time control

Practical applicationsGSM-AMR (Adaptive Multi-Rate) speech codec in 3G GSM phonesServo motor control (HDD/DVD)Audio encode/decode (MP3, AAC, WMA)MPEG4 decodeVoice and handwriting recognition

Page 400: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-20Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

DSP Instructions

Count leading zeros COUNTZ(Rm)CLZ{cond} Rd, Rm

Saturating subtract double SAT(Rm – SAT(Rs x 2)) QDSUB Rd, Rm, Rs

Saturating subtract SAT(Rm – Rd) QSUB Rd, Rm, Rs

Saturating add double SAT(Rm + SAT(Rs x 2)) QDADD Rd, Rm, Rs

Saturating add SAT(Rm + Rd) QADD Rd, Rm, Rs

Signed multiply long 16 x 32 → 32 SMULWy{cond}

Signed multiply 16 x 16 → 32 SMULxy{cond}

Signed MAC long 16 x 16 + 64 → 64 SMLALxy{cond}

Signed MAC wide 32 x 16 + 32 → 32 SMLAWy{cond}

Signed MAC 16 x 16 + 32 → 32 SMLAxy{cond}

Purpose Operation Instruction

16 — halfword32 — word64 — doubleword

Pin overflow result at max or minNo modulo arithmetic or report of overflow

Saturating

Multiply – Accumulate (Rd ← R1 x R2 + R3)MAC

Page 401: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-21Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

Performance Comparisons

ARM9 before DSP enhancements

ARM10 with DSP enhancements

Q15 / Q31 — integer arithmetic techniques used in DSP

Page 402: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-22Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

Apple iPhone 5 HardwareApple A6 Application Processor

Dual ARMv7 cores + 3 GPUHand-optimized layout

Memory Hynix 16 GB Flash

Network processorsSkyworks GSM / GPRS / EDGE moduleSkyworks CDMA module Triquint WCDMA / HSUPA / UMTSQualcomm LTE processorMurata WiFi module

Interface controllersApple Power Management ICApple Audio CODECTexas Instruments touch screen controllerSTMicroelectronics 3-axis gyroSTMicroelectronics 3-axis linear accelerometer

Ref: http://www.chipworks.com/blog/recentteardowns/2012/09/20/2467/

Page 403: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-23Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

WinARMCross-compiler

Develop applications for ARM in C / C++ on Windows platformsExtensive documentation

Tools GNU GCC compilerGNU-Utils for compiler/linkerARM header-filesSample applications with source-code

Downloadhttp://www.siwawi.arubi.uni-kl.de/avr_projects/arm_projects

Convert C code to assembly codearm-elf-gcc -S filename.c -o filename.asm

Page 404: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-24Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

ARM Compilation 1‐1C source

main(){

int x = 0;while (x < 10){

x++;}

}

Assembly source

main:mov ip, spstmfd sp!, {fp, ip, lr, pc}sub fp, ip, #4sub sp, sp, #4mov r3, #0str r3, [fp, #-16]b .L2

.L3:ldr r3, [fp, #-16]add r3, r3, #1str r3, [fp, #-16]

.L2:ldr r3, [fp, #-16]cmp r3, #9ble .L3ldmfd sp, {r3, fp, sp, pc}

Page 405: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-25Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

ARM Compilation 1‐2Building data frame

mov ip, spstmfd sp!, {fp, ip, lr, pc}

; push fp, ip, lr, pc to stacksub fp, ip, #4

; fp = ip – 4; sp = ip – 16 = fp - 12

sub sp, sp, #4; sp = fp – 16

mov r3, #0str r3, [fp, #-16]

—ip = sp

fp

ip

lr

pc

sp

ip

fp - 16

fp - 12

fp - 8

fp - 4

x = 0sp

fp

ip

lr

pc

fp

ip

fp

ip

lr

pc

sp

fp

ip

Page 406: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-26Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

ARM Compilation 1‐3Executing loop

b .L2 ; branch to .L2

.L3:

ldr r3, [fp, #-16] ; r3 ← x

add r3, r3, #1 ; r3++

str r3, [fp, #-16] ; x ← r3

.L2:

ldr r3, [fp, #-16] ; r3 ← x

cmp r3, #9 ; compare r3 , 9

ble .L3 ; jump .L3 if r3 ≤ 9

ldmfd sp, {r3, fp, sp, pc} ; restore registers

Page 407: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-27Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

ARM Compilation 2‐1C source

main(){

int x , y;for (x = 0 ; x < 10 ; x++){

y = x + 4;}

}

Assembly source

main:mov ip, spstmfd sp!, {fp, ip, lr, pc}sub fp, ip, #4sub sp, sp, #8 ; 2 integersmov r3, #0str r3, [fp, #-20]b .L2

.L3:ldr r3, [fp, #-20]add r3, r3, #4str r3, [fp, #-16]ldr r3, [fp, #-20]add r3, r3, #1str r3, [fp, #-20]

.L2:ldr r3, [fp, #-20]cmp r3, #9ble .L3sub sp, fp, #12ldmfd sp, {fp, sp, pc}

fp – 20x = 0sp

fp - 16

fp - 12

fp - 8

fp - 4

y

fp

ip

lr

pc

fp

ip

Page 408: Introduction to Computer Architecturecs.hac.ac.il/staff/martin/Architecture/00_arch_slides.pdf · 2019-01-31 · Computer Architecture — Hadassah College — Spring 2019 Introduction

10-28Dr. Martin LandReal-Life RISCComputer Architecture — Hadassah College — Spring 2019

ARM Compilation 2‐2Executing loop

b .L2

.L3:

ldr r3, [fp, #-20] ; r3 ← x

add r3, r3, #4 ; r3 ← r3 + 4

str r3, [fp, #-16] ; y ← r3

ldr r3, [fp, #-20] ; r3 ← x

add r3, r3, #1 ; r3++

str r3, [fp, #-20] ; x ← r3

.L2:

ldr r3, [fp, #-20] ; r3 ← x

cmp r3, #9 ; compare r3 , 9

ble .L3 ; jump .L3 if r3 ≤ 9

sub sp, fp, #12 ; sp ← fp – 12

ldmfd sp, {fp, sp, pc} ; restore registers