ece200 – computer organization chapter 2 - the role of performance

ECE200 – Computer Organization

Chapter 2 - The Role of Performance

Homework 2

2.1-2.4, 2.10, 2.11, 2.13-2.17, 2.26-2.28, 2.39, 2.41-2.44

Outline for Chapter 2 lectures

How computer systems are generally evaluated

How architects make design tradeoffs

Performance metrics

Combining performance results

Amdahl’s Law

Evaluating computer system performance

A workload is a collection of programs

A user’s workload are the programs that they run day in and day out on their computer

Ideally, the user evaluates the performance of their workload on a given machine before deciding whether to purchase it

We common folk don’t get to do this!Maybe General Motors can…

So how can we as customers make purchase decisions without being able to run our programs on different machines?

Benchmarks

Benchmarks are particular programs chosen to measure the “goodness” (usually performance) of a machine

Benchmark suites attempt to mimic the workloads of particular user communities Scientific benchmarks, business benchmarks,

consumer benchmarks, etc.

Computer manufacturers report performance results for benchmarks to aid users in making machine comparisonsThe hope is that most user workloads can be

represented well enough by a modest set of benchmark suites

The SPEC benchmarks

SPEC = System Performance Evaluation CooperativeEstablished in 1989 by computer companies to create

a benchmark set and reporting practices for evaluating CPU and memory system performance

Provides a set of primarily integer benchmarks (SPECint) and a set of primarily floating point benchmarks (SPECfp)

Results reported by companies using SPEC Individual benchmark resultsA composite integer result (single number)A composite floating point result (single number)Throughput results obtained by simultaneously running

multiple copies of each individual benchmark SPEC also has java, web, and other benchmarkswww.spec.org

http://www.spec.org/

The latest version: SPEC CPU2000

Comprised of SPECint2000 and SPECfp2000 benchmarks

SPECint2000 programs164.gzip: Data compression utility175.vpr: FPGA circuit placement and routing176.gcc: C compiler181.mcf: Minimum cost network flow solver186.crafty: Chess program197.parser: Natural language processing252.eon: Ray tracing253.perlbmk: Perl254.gap: Computational group theory255.vortex: Object-oriented database256.bzip2: Data compression utility300.twolf: Place and route simulator

The latest version: SPEC CPU2000

SPECfp2000 programs168.wupwise: Quantum chromodynamics171.swim: Shallow water modeling172.mgrid: Multi-grid solver in 3D potential field173.applu: Parabolic/elliptic partial differential equations177.mesa: 3D graphics library178.galgel: Fluid dynamics: analysis of oscillatory instability179.art: Neural network simulation: adaptive resonance

theory183.equake: Finite element simulation: earthquake modeling187.facerec: Computer vision: recognizes faces188.ammp: Computational chemistry189.lucas: Number theory: primality testing191.fma3d: Finite-element crash simulation200.sixtrack: Particle accelerator model301.apsi: Solves problems regarding temperature, wind,

distribution of pollutants

Benchmarks and the architect

Computer architects developing a new machineWant to know how fast their machine will run

compared to current offeringsAlso want to know the cost/performance benefit when

making tradeoffs throughout the design processExample: if I change the cache size, what will be the relative

change in performance?

Standard benchmarks (like SPEC) are used by architects in evaluating design tradeoffsReal customer applications (like PowerPoint) are also

used

Architecting a new machine

Chicken and egg problem: Architects need to compare different design options before having the systems to run the benchmarks on

Solution of long ago: hardware prototypingBuild a system, evaluate it, re-design it, re-evaluate it,

…

High level of circuit integration makes this nearly impossible todayDifficult to get at internals of chipToo costly ($ and time) to re-spin

Simulation and circuit analysis

To evaluate the performance of different design options we needThe number of clock cycles required to run each

benchmark for each design optionThe clock frequency of each design optionExecution time = number of clock cycles/clock

frequency

Clock frequency is evaluated by circuit designers working in conjunction with the architects

The number of clock cycles is determined by the architects via architectural simulation

Architectural simulation

A model that faithfully mimics the operation of the new computer system is written in a HLL

The model executes machine code and collects performance statistics while it runsPower dissipation may also be evaluatedFunctional correctness is not tested at this level (later

when HDL design is completed)

Various design parameters can be changed to allow architects to explore how combinations of design options impact performance

An example: SimpleScalar

An architectural simulator written by Todd Austin (currently a Professor at Michigan)

Written in C

Executes MIPS programs among others

Widely used in the architecture community (especially by academics)

Models a high performance CPU, caches, and main memory

Publicly available at www.simplescalar.com

We use this a LOT in our research hereCheck out www.ece.rochester.edu/research/acal

http://www.simplescalar.com/



http://www.ece.rochester.edu/research/acal








SimpleScalar input file (partial)

-fetch:ifqsize 16 # instruction fetch queue size (in insts)

-fetch:mplat 2 # extra branch mis-prediction latency

-fetch:speed 2 # speed of front-end of machine relative to execution core

-bpred comb # branch predictor type {nottaken|taken|perfect|bimod|2lev|comb}

-bpred:bimod 4096 # bimodal predictor config (<table size>)

-bpred:2lev 1 4096 12 1 # 2-level predictor config (<l1size> <l2size> <hist_size> <xor>)

-bpred:comb 4096 # combining predictor config (<meta_table_size>)

-bpred:ras 64 # return address stack size (0 for no return stack)

-bpred:btb 2048 4 # BTB config (<num_sets> <associativity>)

-bpred:spec_update <null> # speculative predictors update in {ID|WB} (default non-spec)

-decode:width 8 # instruction decode B/W (insts/cycle)

-issue:width 4 # instruction issue B/W (insts/cycle)

-issue:inorder false # run pipeline with in-order issue

-issue:wrongpath true # issue instructions down wrong execution paths

-commit:width 4 # instruction commit B/W (insts/cycle)

-ruu:size 64 # register update unit (RUU) size

-lsq:size 16 # load/store queue (LSQ) size

SimpleScalar input file (partial)

-cache:dl1 dl1:256:32:2:r # l1 data cache config, i.e., {<config>|none}

-cache:dl1lat 1 # l1 data cache hit latency (in cycles)

-cache:dl2 ul2:32768:32:4:l # l2 data cache config, i.e., {<config>|none}

-cache:dl2lat 15 # l2 data cache hit latency (in cycles)

-cache:il1 il1:512:32:4:r # l1 inst cache config, i.e., {<config>|dl1|dl2|none}

-cache:il1lat 1 # l1 instruction cache hit latency (in cycles)

-cache:il2 dl2 # l2 instruction cache config, i.e., {<config>|dl2|none}

-cache:il2lat 15 # l2 instruction cache hit latency (in cycles)

-cache:flush false # flush caches on system calls

-cache:icompress false # convert 64-bit inst addresses to 32-bit inst equivalents

-mem:lat 75 2 # memory access latency (<first_chunk> <inter_chunk>)

-mem:width 16 # memory access bus width (in bytes)

-res:ialu 2 # total number of integer ALU's available

-res:imult 2 # total number of integer multiplier/dividers available

-res:memport 2 # total number of memory system ports available (to CPU)

-res:fpalu 2 # total number of floating point ALU's available

-res:fpmult 2 # total number of floating point multiplier/dividers available

ETC

SimpleScalar output file (partial)

sim_num_insn 400000000 # total number of instructions committed

sim_num_refs 211494189 # total number of loads and stores committed

sim_num_loads 152980862 # total number of loads committed

sim_num_stores 58513327.0000 # total number of stores committed

sim_num_branches 6017796 # total number of branches committed

sim_elapsed_time 7735 # total simulation time in seconds

sim_inst_rate 51712.9929 # simulation speed (in insts/sec)

sim_total_insn 432323546 # total number of instructions executed

sim_total_refs 214748856 # total number of loads and stores executed

sim_total_loads 155672898 # total number of loads executed

sim_total_stores 59075958.0000 # total number of stores executed

sim_total_branches 6921320 # total number of branches executed

sim_cycle 564744596 # total simulation time in cycles

sim_IPC 0.7083 # instructions per cycle

sim_CPI 1.4119 # cycles per instruction

sim_exec_BW 0.7655 # total instructions (mis-spec + committed) per cycle

sim_IPB 66.4695 # instruction per branch

SimpleScalar output file (partial)

bpred_comb.lookups 7496023 # total number of bpred lookups

bpred_comb.updates 6017796 # total number of updates

bpred_comb.addr_hits 5386455 # total number of address-predicted hits

bpred_comb.dir_hits 5669839 # total number of direction-predicted hits (includes addr-hits)

bpred_comb.used_bimod 4795297 # total number of bimodal predictions used

bpred_comb.used_2lev 1222499 # total number of 2-level predictions used

bpred_comb.misses 347957 # total number of misses

bpred_comb.jr_hits 106171 # total number of address-predicted hits for JR's

bpred_comb.jr_seen 496372 # total number of JR's seen

bpred_comb.bpred_addr_rate 0.8951 # branch address-prediction rate (i.e., addr-hits/updates)

bpred_comb.bpred_dir_rate 0.9422 # branch direction-prediction rate (i.e., all-hits/updates)

bpred_comb.bpred_jr_rate 0.2139 # JR address-prediction rate (i.e., JR addr-hits/JRs seen)

bpred_comb.retstack_pushes 363292 # total number of address pushed onto ret-addr stack

bpred_comb.retstack_pops 645247 # total number of address popped off of ret-addr stack

SimpleScalar output file (partial)il1.accesses 446551244.0000 # total number of accesses

il1.hits 415761995 # total number of hits

il1.misses 30789249 # total number of misses

il1.replacements 30787201 # total number of replacements

il1.writebacks 0 # total number of writebacks

il1.invalidations 0 # total number of invalidations

il1.miss_rate 0.0689 # miss rate (i.e., misses/ref)

il1.repl_rate 0.0689 # replacement rate (i.e., repls/ref)

il1.wb_rate 0.0000 # writeback rate (i.e., wrbks/ref)

il1.inv_rate 0.0000 # invalidation rate (i.e., invs/ref)

dl1.accesses 207813682.0000 # total number of accesses

dl1.hits 204118070 # total number of hits

dl1.misses 3695612 # total number of misses

dl1.replacements 3695100 # total number of replacements

dl1.writebacks 1707742 # total number of writebacks

dl1.invalidations 0 # total number of invalidations

dl1.miss_rate 0.0178 # miss rate (i.e., misses/ref)

ETC

Performance metrics

Execution timeAlso known as wall clock time, elapsed time, response

timeTotal time to complete a taskExample: hit RETURN, how long until the answer

appears on the screen

ThroughputAlso known as bandwidthTotal number of operations (such as instructions,

memory requests, programs) completed per unit time (rate)

Performance improves when execution time is reduced or throughput is increased

Breaking down execution time

Breaking execution time into components allows designers to focus on particular machine levelsI/O operations are often

overlapped with the execution of another task on the CPU

I/O system may be designed almost independently of rest of system

CPU time ignores the I/O component of execution time

CentralProcessing

Unit

Level1Instruction

Cache

Level1Data

Cache

Level2Cache

MainMemory

Input/Output

Interconnect

diskkeyboard/mousenetworketc

instructions operands

CPU Time as a performance metric

The time spent by the CPU, caches, and main memory in executing the workload

Ignores any idle time due to I/O activity Two components

User CPU Time: the time spent by user programsSystem CPU Time: the time spent by the operating

system (OS)

User CPU Time is often only evaluated…Many standard benchmarks like SPEC have little OS

activityMany architectural simulators do not support the OS

There are exceptions, such as SimOS from StanfordOS code is not often available to evaluate

CPU Time breakdown

CPU Time = CYCLES x CT = INST x CPI x CT

CYCLESTotal cycles to execute the program

CTClock cycle time (clock period)1/clock frequency

INSTTotal number of assembly instructions executed

CPIAverage number of clock cycles executed per instruction Total clock cycles/total instructions executed

(CYCLES/INST)Different instruction types (add, divide, etc.) may take

different numbers of clock cycles to execute

CPU Time example

CYCLES = 6

INST = 4

CPI = 6/4 = 1.5

CT = 1ns

CPU Time = CYCLES x CT = 6 x 1ns = 6ns

lw $4, 0($2) 2 cycleslw $6, 4($2) 2 cyclesadd $4, $4, $6 1 cyclesw $4, 0($2) 1 cycle

What parts of CPU Time (INST, CPI, CT)…

Are influenced by the ISA designer?

Are influenced by the compiler writer?

Are influenced by the microarchitect?

What parts of CPU Time can be ignored if

The programs are already compiled and you are designing the microarchitecture?

The ISA and microarchitecture are fixed and you are developing a compiler?

You are comparing two machines that have different ISAs?

Latency

Latency = number of clock cycles required to do somethingAccess cache, execute a particular instruction, etc.Alternate definition: amount of time (ns) to do something

Designers may increase latency in order to decrease CT

Why might the higher latency option increase the total delay?

Why might the higher latency option perform better?

logic

logic

logic

latency = 1 cycleCT = 1 ns

1 ns 0.6 ns 0.55 ns

latency = 2 cyclesCT = 0.6 ns

The CYCLES-CT tradeoff

A feature that improves either CYCLES or CT very often worsens the other

ExamplesIncreasing cache size to reduce CYCLES at the

expense of CTCYCLES is reduced because slow main memory is accessed

less oftenLarger cache operates at a slower speed, may have to

increase CTIncreasing machine parallelism to reduce CYCLES at

the expense of CTmultiplier

2 multiplies at a time downsides?

1 multiply at a time

multiplier

multiplier

The INST-CPI tradeoff

In creating an assembly equivalent to a HLL program, the compiler writer may have several choices that differ in INST and CPIBest solution may involve more, simpler instructions

Example: multiply by constant 5

muli $2, $4 , 5 4 cycles sll $2, $4 , 2 1 cycle

add $2, $2, $4 1 cycle

Summarizing performance results

Useful to generate a single performance number from multiple benchmark results

For execution time (and its derivatives)Total the execution time of all the n programsCan also use the Arithmetic Mean

where Timei is the execution time of the ith programThe Weighted AM assigns weights to each program

where Weighti is the weight assigned to the ith programAll weights add to 1Weighting example: equalize all

n

iiTime

nAM

1

1

n

iii TimeWeight

nWAM

1

1

ii TimeWeight

Summarizing performance results

For performance as a rate, e.g., instructions/secUse the Harmonic Mean

where Ratei is the rate of the ith program

Also have Weighted HM

n

iiRate

nHM

1

Amdahl’s Law (very famous)

The law of diminishing returns

Performance improvement of an enhancement is limited by the fraction of time the enhancement is used

where execution_timeold is the execution time without the

enhancement, fractionenhanced is the fraction of the time (NOT the

instructions) that can take advantage of the enhancement

speedupenhanced is the speedup obtained when using the enhancement

enhanced

enhancedenhancedoldnew

speedup

fractionfractiontimeexecutiontimeexecution )1[(__

Amdahl’s Law example

Assume multiply operations constitute 20% of the execution time of a benchmark

What is the execution time improvement for a new multiplier that provides a 10 times speedup over the existing multiplier?

]10

2.0)2.01[(__ oldnew timeexecutiontimeexecution

22.1_

_

new

old

timeexecution

timeexecutionspeedup

Questions?

ece200 – computer organization chapter 2 - the role of performance

Documents