workload-driven evaluation cs 258, spring 99 david e. culler computer science division u.c. berkeley

48
Workload-Driven Evaluation CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

Post on 21-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Workload-Driven Evaluation

CS 258, Spring 99

David E. Culler

Computer Science Division

U.C. Berkeley

2/12/99 CS258 S99 2

Workload-Driven Evaluation

• Evaluating real machines

• Evaluating an architectural idea or trade-offs

=> need good metrics of performance

=> need to pick good workloads

=> need to pay attention to scaling– many factors involved

2/12/99 CS258 S99 3

Working Set Perspective

– Hierarchy of working sets– At first level cache (fully assoc, one-word block), inherent to algorithm

» working set curve for program– Traffic from any type of miss can be local or nonlocal (communication)

•At a given level of the hierarchy (to the next further one)

First working set

Capacity-generated traffic(including conflicts)

Second working set

Dat

a t

raf fi

c

Other capacity-independent communication

Cold-start (compulsory) traffic

Replication capacity (cache size)

Inherent communication

2/12/99 CS258 S99 4

Example Application Set

Table 4.1 General Statistics about Application Programs

ApplicationInput

Data Set

TotalInstructions

(M)

Total FLOPS

(M)

TotalReferences

(M)

Total Reads(M)

Total Writes

(M)

Shared Reads(M)

Shared Writes

(M) Barriers Locks

LU 512 512 matrix16 16 blocks

489.52 92.20 151.07 103.09 47.99 92.79 44.74 66 0

Ocean 258 258 gridstolerance = 10–7

4 time-steps

376.51 101.54 99.70 81.16 18.54 76.95 16.97 364 1,296

Barnes-Hut 16-K particles = 1.03 time-steps

2,002.74 239.24 720.13 406.84 313.29 225.04 93.23 7 34,516

Radix 256-K pointsradix = 1,024

84.62 — 14.19 7.81 6.38 3.61 2.18 11 16

Raytrace Car scene 833.35 — 290.35 210.03 80.31 161.10 22.35 0 94,456

Radiosity Room scene 2,297.19 — 769.56 486.84 282.72 249.67 21.88 10 210,485

Multiprog: User

SGI IRIX 5.2, two pmakes + two compress jobs

1,296.43 — 500.22 350.42 149.80 — — — —

Multiprog: Kernel

668.10 — 212.58 178.14 34.44 — — — 621,505

For the parallel programs, shared reads and writes simply refer to all nonstack references issued by the application processes. All such references do not necessarily point to data that is truly shared by multiple processes. The Multiprog workload is not a parallel application, so it does not access shared data. A dash in a table entry means that this measurement is not applicable to or is not measured for that application (e.g., Radix has no oating-point operations). (M) denotes that measurement in that column is in millions.

2/12/99 CS258 S99 5

Working Sets (P=16, assoc, 8 byte)

Mis

s r

ate

(%

)M

iss r

ate

(%

)

Cache size (K)

(a) LU

Cache size (K)

(e) Raytrace

Cache size (K)

(f) Radix

Cache size (K)

(b) Ocean

Cache size (K)

(c) Barnes–HutM

iss r

ate

(%

)

Cache size (K)

(d) Radiosity

1 2 4 8

16

32

64

128

256

512

1,0

240

10

20

30

40

Mis

s r

ate

(%

)

1 2 4 8

16

32

64

128

256

512

1,0

240

20

40

60

1 2 4 8

16

32

64

128

256

512

1,0

240

2

4

6

8

10

Mis

s r

ate

(%

)

Mis

s r

ate

(%

)

1 2 4 8

16

32

64

128

256

512

1,0

240

5

10

15

20

1 2 4 8

16

32

64

128

256

512

1,0

240

5

10

15

20

1 2 4 8

16

32

64

128

256

512

1,0

240

10

20

30

40

50

L1 WS

L1 WS

L1 WS

L1 WS

L1 WS

L2 WS

L0 WS

L1 WS

L2 WS

L2 WS

L2 WS

L2 WS

2/12/99 CS258 S99 6

Working Sets Change with P (NPB)

8-fold reductionin miss rate from4 to 8 proc

2/12/99 CS258 S99 7

Where the Time Goes: NPB LU-a

0

500

1000

1500

2000

2500

3000

4 8 16 32

Processors

To

tal T

ime Wait

Receive

Send

Compute

2/12/99 CS258 S99 8

False Sharing Misses: Artifactual Comm.• Different processors

update different words in same block

• Hardware treats it as sharing

– cache block is unit of coherence

• Ping-pongs between caches

P6 P7P4

P8

P0 P1 P2 P3

P5

Cache blockstraddles partitionboundary

Contiguity in memory layout

2/12/99 CS258 S99 9

Questions in Scaling

• Scaling a machine: Can scale power in many ways

– Assume adding identical nodes, each bringing memory

• Problem size: Vector of input parameters, e.g. N = (n, q, t)

– Determines work done

– Distinct from data set size and memory usage

• Under what constraints to scale the application?– What are the appropriate metrics for performance

improvement?

» work is not fixed any more, so time not enough

• How should the application be scaled?

2/12/99 CS258 S99 10

Under What Constraints to Scale?• Two types of constraints:

– User-oriented, e.g. particles, rows, transactions, I/Os per processor

– Resource-oriented, e.g. memory, time

• Which is more appropriate depends on application domain

– User-oriented easier for user to think about and change

– Resource-oriented more general, and often more real

• Resource-oriented scaling models:– Problem constrained (PC)

– Memory constrained (MC)

– Time constrained (TC)

• (TPC: transactions, users, terminals scale with “computing power”)

• Growth under MC and TC may be hard to predict

2/12/99 CS258 S99 11

Problem Constrained Scaling

• User wants to solve same problem, only faster– Video compression

– Computer graphics

– VLSI routing

• But limited when evaluating larger machines

SpeedupPC(p) = Time(1)Time(p)

2/12/99 CS258 S99 12

Time Constrained Scaling

• Execution time is kept fixed as system scales– User has fixed time to use machine or wait for result

• Performance = Work/Time as usual, and time is fixed, so

SpeedupTC(p) =

• How to measure work?– Execution time on a single processor? (thrashing problems)

– Should be easy to measure, ideally analytical and intuitive

– Should scale linearly with sequential complexity

» Or ideal speedup will not be linear in p (e.g. no. of rows in matrix program)

– If cannot find intuitive application measure, as often true, measure execution time with ideal memory system on a uniprocessor (e.g. pixie)

Work(p)Work(1)

2/12/99 CS258 S99 13

Memory Constrained Scaling

• Scale so memory usage per processor stays fixed

• Scaled Speedup: Time(1) / Time(p) for scaled up problem– Hard to measure Time(1), and inappropriate

SpeedupMC(p) =

• Can lead to large increases in execution time– If work grows faster than linearly in memory usage

– e.g. matrix factorization

» 10,000-by 10,000 matrix takes 800MB and 1 hour on uniprocessor. With 1,000 processors, can run 320K-by-320K matrix, but ideal parallel time grows to 32 hours!

» With 10,000 processors, 100 hours ...

Work(p)Time(p)

xTime(1)Work(1)

=Increase in WorkIncrease in Time

2/12/99 CS258 S99 14

Scaling Summary

• Under any scaling rule, relative structure of the problem changes with P

– PC scaling: per-processor portion gets smaller

– MC & TC scaling: total problem get larger

• Need to understand hardware/software interactions with scale

• For given problem, there is often a natural scaling rule

– example: equal error scaling

2/12/99 CS258 S99 15

Types of Workloads

– Kernels: matrix factorization, FFT, depth-first tree search

– Complete Applications: ocean simulation, crew scheduling, database

– Multiprogrammed Workloads

• Multiprog. Appls Kernels Microbench.

Realistic ComplexHigher level interactionsAre what really matters

Easier to understandControlledRepeatableBasic machine characteristics

Each has its place:

Use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance

2/12/99 CS258 S99 16

NOW Ultra 170 vs Enterprise 5000

• Workstation UPA– cross bar

• SMP– switch between Ultrasparc

coherence protocol (MOESI) and bus protocol (MSI)

L2 Cache

64MB

B/AS-bus (25 MHz)Mem

UltraSparc

dmadma

dma

MP sram

MyricomLanai NIC(37.5 MHz proc,256 MB sram3 dma units)

160 MB/sbidirectionallinks

8-portwormholeswitches

UPA

GigaplaneTM bus (256 data, 41 address, 83 MHz)

P

$2

$P

$2

$

mem ctrl

Bus Interface / Switc h

SB

US

SB

US

SB

US

2 FiberC

hannel

100bT, SC

SI

Bus Interface

4 Processing Cards - 2 x Ultra1 CPUs with 512 MB L2 - 2 x 64 MB DRAM banks

3 I/O Cards - 2 64B x 25 MHzSBus each

Multiple MyricomLanai networkinterface cards

2/12/99 CS258 S99 17

Microbenchmarks

• Memory access latency (512KB L2, 64B blocks)– Enterprise 5000: 51 cycles Ultra 170: 44 cycles

– other L2: 84 cycles

• Memory copy bandwidth– Enterprise 5000: 184 MB/s Ultra 170: 168 MB/s

• Arithmetic, floating point, graphics, ...

2/12/99 CS258 S99 18

Coverage: Stressing Features

• Easy to mislead with workloads – Choose those with features for which machine is good, avoid

others

• Some features of interest:– Compute v. memory v. communication v. I/O bound

– Working set size and spatial locality

– Local memory and communication bandwidth needs

– Importance of communication latency

– Fine-grained or coarse-grained

» Data access, communication, task size

– Synchronization patterns and granularity

– Contention

– Communication patterns

• Choose workloads that cover a range of properties

2/12/99 CS258 S99 19

Coverage: Levels of Optimization• Many ways in which an application can be suboptimal

– Algorithmic, e.g. assignment, blocking

– Data structuring, e.g. 2-d or 4-d arrays for SAS grid problem

– Data layout, distribution and alignment, even if properly structured

– Orchestration» contention» long versus short messages» synchronization frequency and cost, ...

– Also, random problems with “unimportant” data structures

• Optimizing applications takes work– Many practical applications may not be very well optimized

• May examine selected different levels to test robustness of system

2np

4np

2/12/99 CS258 S99 20

Concurrency• Should have enough to utilize the processors

– If load imbalance dominates, may not be much machine can do

– (Still, useful to know what kinds of workloads/configurations don’t have enough concurrency)

• Algorithmic speedup: useful measure of concurrency/imbalance

– Speedup (under scaling model) assuming all memory/communication operations take zero time

– Ignores memory system, measures imbalance and extra work

– Uses PRAM machine model (Parallel Random Access Machine)» Unrealistic, but widely used for theoretical algorithm development

• At least, should isolate performance limitations due to program characteristics that a machine cannot do much about (concurrency) from those that it can.

2/12/99 CS258 S99 21

Workload/Benchmark Suites

• Numerical Aerodynamic Simulation (NAS)– Originally pencil and paper benchmarks

• SPLASH/SPLASH-2– Shared address space parallel programs

• ParkBench– Message-passing parallel programs

• ScaLapack– Message-passing kernels

• TPC– Transaction processing

• SPEC-HPC

• . . .

2/12/99 CS258 S99 22

Evaluating a Fixed-size Machine

• Many critical characteristics depend on problem size

– Inherent application characteristics

» concurrency and load balance (generally improve with problem size)

» communication to computation ratio (generally improve)

» working sets and spatial locality (generally worsen and improve, resp.)

– Interactions with machine organizational parameters

– Nature of the major bottleneck: comm., imbalance, local access...

• Insufficient to use a single problem size

• Need to choose problem sizes appropriately– Understanding of workloads will help

2/12/99 CS258 S99 23

Our problem today

• Evaluate architectural alternatives– protocols, block size

• Fix machine size and characteristics

• Pick problems and problem sizes

2/12/99 CS258 S99 24

Steps in Choosing Problem Sizes

1. Appeal to higher powersMay know that users care only about a few problem sizes

But not generally applicable

2. Determine range of useful sizesBelow which bad perf. or unrealistic time distribution in phases

Above which execution time or memory usage too large

3. Use understanding of inherent characteristicsCommunication-to-computation ratio, load balance...

For grid solver, perhaps at least 32-by-32 points per processor

40MB/s c-to-c ratio with 200MHz processor

No need to go below 5MB/s (larger than 256-by-256 subgrid per processor) from this perspective, or 2K-by-2K grid overall

2/12/99 CS258 S99 25

Steps in Choosing Problem Sizes

• Variation of characteristics with problem size usually smooth

– So, for inherent comm. and load balance, pick some sizes along range

• Interactions of locality with architecture often have thresholds (knees)

– Greatly affect characteristics like local traffic, artifactual comm.

– May require problem sizes to be added

» to ensure both sides of a knee are captured

– But also help prune the design space

2/12/99 CS258 S99 26

Choosing Problem Sizes (contd.)

– Choose problem sizes on both sides of a knee if realistic» Critical to understand growth rate of working sets

– Also try to pick one very large size (exercises TLB misses etc.)– Solver: first (2 subrows) usually fits, second (full partition) may or not

» Doesn’t for largest (2K) so add 4K-b-4K grid» Add 16K as large size, so grid sizes now 256, 1K, 2K, 4K, 16K (in each dimension)

4. Use temporal locality and working sets

Fitting or not dramatically changes local traffic and artifactual comm.

E.g. Raytrace working sets are nonlocal, Ocean are local

Missratio

C

WS1

WS2

WS3

Problem1

WS3 WS2 WS1

Cache size Problem size

100

Problem2

Problem3Problem4

% ofworkingset that ts incache ofsize C

(a) (b)

2/12/99 CS258 S99 27

Multiprocessor Simulation

• Simulation runs on a uniprocessor (can be parallelized too)

– Simulated processes are interleaved on the processor

• Two parts to a simulator:– Reference generator: plays role of simulated processors

» And schedules simulated processes based on simulated time

– Simulator of extended memory hierarchy

» Simulates operations (references, commands) issued by reference generator

• Coupling or information flow between the two parts varies– Trace-driven simulation: from generator to simulator

– Execution-driven simulation: in both directions (more accurate)

• Simulator keeps track of simulated time and detailed statistics

2/12/99 CS258 S99 28

Execution-driven Simulation• Memory hierarchy simulator returns simulated time information to

reference generator, which is used to schedule simulated processes

P1

P2

P3

Pp

$1

$2

$3

$p

Mem1

Mem2

Mem3

Memp

Reference generator Memory and interconnect simulator

···

···

Network

2/12/99 CS258 S99 29

Difficulties in Simulation-based Evaluation

• Cost of simulation (in time and memory)– cannot simulate the problem/machine sizes we care about

– have to use scaled down problem and machine sizes

» how to scale down and stay representative?

• Huge design space– application parameters (as before)

– machine parameters (depending on generality of evaluation context)» number of processors» cache/replication size» associativity» granularities of allocation, transfer, coherence» communication parameters (latency, bandwidth, occupancies)

– cost of simulation makes it all the more critical to prune the space

2/12/99 CS258 S99 30

Choosing Parameters• Problem size and number of processors

– Use inherent characteristics considerations as discussed earlier

– For example, low c-to-c ratio will not allow block transfer to help much

• Cache/Replication Size– Choose based on knowledge of working set curve

– Choosing cache sizes for given problem and machine size analogous to choosing problem sizes for given cache and machine size, discussed

– Whether or not working set fits affects block transfer benefits greatly

» if local data, not fitting makes communication relatively less important

» If nonlocal, can increase artifactual comm. So BT has more opportunity

– Sharp knees in working set curve can help prune space

» Knees can be determined by analysis or by very simple simulation

2/12/99 CS258 S99 31

Our Cache Sizes (16x1MB, 16x64KB)M

iss r

ate

(%

)M

iss r

ate

(%

)

Cache size (K)

(a) LU

Cache size (K)

(e) Raytrace

Cache size (K)

(f) Radix

Cache size (K)

(b) Ocean

Cache size (K)

(c) Barnes–Hut

Mis

s r

ate

(%

)

Cache size (K)

(d) Radiosity

1 2 4 8

16

32

64

128

256

512

1,0

240

10

20

30

40

Mis

s r

ate

(%

)

1 2 4 8

16

32

64

128

256

512

1,0

240

20

40

60

1 2 4 8

16

32

64

128

256

512

1,0

240

2

4

6

8

10

Mis

s r

ate

(%

)

Mis

s r

ate

(%

)

1 2 4 8

16

32

64

128

256

512

1,0

240

5

10

15

20

1 2 4 8

16

32

64

128

256

512

1,0

240

5

10

15

20

1 2 4 8

16

32

64

128

256

512

1,0

240

10

20

30

40

50

L1 WS

L1 WS

L1 WS

L1 WS

L1 WS

L2 WS

L0 WS

L1 WS

L2 WS

L2 WS

L2 WS

L2 WS

2/12/99 CS258 S99 32

Focus on protocol tradeoffs

• Methodology:– Use Splash II and Multiprogram workload (ala Ch 4)

– Choose $ parameters per earlier methodology

» default 1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for some

– Focus on frequencies, not end performance for now» transcends architectural details, but not what we’re really

after

– Use idealized memory performance model to avoid changes of reference interleaving across processors with machine parameters

» Cheap simulation: no need to model contention– Run program on parallel machine simulator

» collect trace of cache state transitions» analyze properties of the transitions

2/12/99 CS258 S99 33

Bandwidth per transition

Bus Transaction Address / Cmd Data

BusRd 6 64

BusRdX 6 64

BusWB 6 64

BusUpgd 6 --

NP I E S M

NP 0 0 1.25 0.96 0.001

I 0.64 0 0 1.87 0.001

E 0.20 0 14.00 0.0 2.24

S 0.42 2.50 0 134.72 2.24

M 2.63 0.00 0 2.30 843.57

Ocean Data Cache Frequency Matrix (per 1000)

PrWr/—

BusRd/Flush

PrRd/

BusRdX/Flush

PrWr/BusRdX

PrWr/—

PrRd/—

PrRd/—BusRd/Flush’

E

M

I

S

PrRd

BusRd(S)

BusRdX/Flush’

BusRdX/Flush

BusRd/Flush

PrWr/BusRdX

PrRd/BusRd (S)

2/12/99 CS258 S99 34

Bandwidth Trade-offT

raff

ic (

MB

/s)

0

20

40

60

80

100

120

140

160

180

200barnes/Ill

barnes/3St

barnes/3St-RdE

x

lu/Ill

lu/3St

lu/3St-RdE

x

ocean/Ill

ocean/3S

t

ocean/3S

t-RdE

x

radiosity/Ill

radiosity/3St

radiosity/3St-RdE

x

radix/Ill

radix/3S

t

radix/3S

t-RdE

x

raytrace/Ill

raytrace/3St

raytrace/3St-RdE

x

Cmd

Bus

Tra

ffic

(M

B/s

)

0

5

10

15

20

25

30

35

40

45

Appl-C

ode/Ill

Appl-C

ode/3S

t

Appl-C

ode/3S

t-RdE

x

Appl-D

ata/Ill

Appl-D

ata/3S

t

Appl-D

ata/3S

t-RdE

x

OS-Code/Ill

OS-Code/3S

t

OS-Code/3S

t-RdE

x

OS-Data/Ill

OS-Data/3S

t

OS-Data/3S

t-RdE

x

Cmd

Bus

E -> M are infrequent BusUpgrade is cheap

1 MB Cache, 200 MIPS / 200 MFLOPS Processor

2/12/99 CS258 S99 35

Smaller (64KB) Caches

Tra

ffic

(M

B/s

)

0

50

100

150

200

250

300

350

400

ocean/Ill

ocean/3S

t

ocean/3S

t-RdE

x

radix/Ill

radix/3S

t

radix/3S

t-RdE

x

raytrace/Ill

raytrace/3St

raytrace/3St-RdE

x

Cmd

Bus

2/12/99 CS258 S99 36

Cache Block Size

• Trade-offs in uniprocessors with increasing block size– reduced cold misses (due to spatial locality)

– increased transfer time

– increased conflict misses (fewer sets)

• Additional concerns in multiprocessors– parallel programs have less spatial locality

– parallel programs have sharing

– false sharing

– bus contention

• Need to classify misses to understand impact– cold misses

– capacity / conflict misses

– true sharing misses

» one proc writes words in a block, invalidating a block in another processor’s cache, which is later read by that process

– false sharing misses

2/12/99 CS258 S99 37

Miss ClassificationMiss Classi¼cation

reasonfor miss

¼rst reference tomemory block by PE

¼rst accesssystem wide

yes

no

writtenbefore

yes

no

modi¼ed word(s) accessedduring lifetime

yes

no

1. pure-cold

2. cold

4. cold-

3. cold-false-sharing

true-sharing

reason forelimination of

last copy

replacement

invalidation

old copywith state=invalid

still there?yesno

8. pure-7. pure-6. inval-cap-true-sharing

5. inval-cap-false-sharing

modi¼ed word(s) accessedduring lifetime

modi¼ed word(s) accessedduring lifetime

yesno yesno

false-sharing true-sharing

other

has blockbeen modi¼ed since

replacement

yes no

12. capacity-11. pure-10. cap-inval-true-sharing

9. cap-inval-false-sharing

modi¼ed word(s) accessedduring lifetime

modi¼ed word(s) accessedduring lifetime

yesno yesno

capacity true-sharing

modified word accessed during lifetimemeans access to word(s) within a blockthat have been modified since the last“essential” (4,6,8,10,12) miss to this block by this processor

2/12/99 CS258 S99 38

Breakdown of Miss Rates with Block Size

Mis

s R

ate

0

0.02

0.04

0.06

0.08

0.1

0.12

ocean/8

ocean/16

ocean/32

ocean/64

ocean/128

ocean/256

radix/8

radix/16

radix/32

radix/64

radix/128

radix/256

raytrace/8

raytrace/16

raytrace/32

raytrace/64

raytrace/128

raytrace/256

UPGMR

FSMR

TSMR

CAPMR

COLDMR

Mis

s R

ate

0

0.001

0.002

0.003

0.004

0.005

0.006

barnes/8

barnes/16

barnes/32

barnes/64

barnes/128

barnes/256 lu/8

lu/16

lu/32

lu/64

lu/128

lu/256

radiosity/8

radiosity/16

radiosity/32

radiosity/64

radiosity/128

radiosity/256

UPGMR

FSMR

TSMR

CAPMR

COLDMR

2/12/99 CS258 S99 39

Breakdown (cont)

Mis

s R

ate

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

Appl-Code/64

Appl-Code/128

Appl-Code/256

Appl-Data/64

Appl-Data/128

Appl-Data/256

OS-C

ode/64

OS-C

ode/128

OS-C

ode/256

OS-D

ata/64

OS-D

ata/128

OS-D

ata/256

UPGMR

FSMR

TSMR

CAPMR

COLDMR

1 MB Cache

2/12/99 CS258 S99 40

Breakdown with 64KB Caches

Mis

s R

ate

0

0.05

0.1

0.15

0.2

0.25

0.3

ocean/8

ocean/16

ocean/32

ocean/64

ocean/128

ocean/256

radix/8

radix/16

radix/32

radix/64

radix/128

radix/256

raytrace/8

raytrace/16

raytrace/32

raytrace/64

raytrace/128

raytrace/256

UPGMR

FSMR

TSMR

CAPMR

COLDMR

2/12/99 CS258 S99 41

Traffic

Tra

ffic

(b

yte

s/i

ns

tr)

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

barnes/8

barnes/32

barnes/128

radiosity/16

radiosity/64

radiosity/256

raytrace/8

raytrace/32

raytrace/128

Cmd

Bus

Tra

ffic

(b

yte

s/F

LO

P)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

lu/8

lu/16

lu/32

lu/64

lu/128

lu/256

ocean/8

ocean/16

ocean/32

ocean/64

ocean/128

ocean/256

Cmd

Bus

Tra

ffic

(b

yte

s/i

ns

tr)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

radix/8

radix/16

radix/32

radix/64

radix/128

radix/256

Cmd

Bus

2/12/99 CS258 S99 42

Traffic with 64 KB cachesT

raff

ic (

by

tes

/in

str

)

0

2

4

6

8

10

12

14radix/8

radix/16

radix/32

radix/64

radix/128

radix/256

raytrace/8

raytrace/16

raytrace/32

raytrace/64

raytrace/128

raytrace/256

Cmd

Bus

Tra

ffic

(b

yte

s/F

LO

P)

0

1

2

3

4

5

6

ocean/8

ocean/16

ocean/32

ocean/64

ocean/128

ocean/256

Bus Cmd

2/12/99 CS258 S99 43

Traffic SimOS 1 MB

Tra

ffic

(b

yte

s/i

ns

tr)

0

0.1

0.2

0.3

0.4

0.5

0.6

Appl-C

ode/64

Appl-C

ode/128

Appl-C

ode/256

Appl-D

ata/64

Appl-D

ata/128

Appl-D

ata/256

OS-Code/64

OS-Code/128

OS-Code/256

OS-Data/64

OS-Data/128

OS-Data/256

Cmd

Bus

2/12/99 CS258 S99 44

Making Large Blocks More Effective

• Software– Improve spatial locality by better data structuring (more later)

– Compiler techniques

• Hardware– Retain granularity of transfer but reduce granularity of

coherence

» use subblocks: same tag but different state bits

» one subblock may be valid but another invalid or dirty

– Reduce both granularities, but prefetch more blocks on a miss

– Proposals for adjustable cache size

– More subtle: delay propagation of invalidations and perform all at once

» But can change consistency model: discuss later in course

– Use update instead of invalidate protocols to reduce false sharing effect

2/12/99 CS258 S99 45

Update versus Invalidate

• Much debate over the years: tradeoff depends on sharing patterns

• Intuition:– If those that used continue to use, and writes between use are few,

update should do better

» e.g. producer-consumer pattern

– If those that use unlikely to use again, or many writes between reads, updates not good

» “pack rat” phenomenon particularly bad under process migration

» useless updates where only last one will be used

• Can construct scenarios where one or other is much better

• Can combine them in hybrid schemes (see text)– E.g. competitive: observe patterns at runtime and change protocol

2/12/99 CS258 S99 46

Update vs Invalidate: Miss Rates

– Lots of coherence misses: updates help– Lots of capacity misses: updates hurt (keep data in cache uselessly)– Updates seem to help, but this ignores upgrade and update traffic

Mis

s ra

te (

%)

Mis

s ra

te (

%)

LU/in

v

LU/u

pd

Oce

an/in

v

Oce

an/m

ix

Oce

an/u

pd

Ray

trac

e/in

v

Ray

trac

e/up

d

0.00

0.10

0.20

0.30

0.40

0.50

0.60

Cold

Capacity

True sharing

False sharing

Rad

ix/in

v

Rad

ix/m

ix

Rad

ix/u

pd

0.00

0.50

1.00

1.50

2.00

2.50

2/12/99 CS258 S99 47

Upgrade and Update Rates (Traffic)– Update traffic is substantial– Main cause is multiple

writes by a processor before a read by other

» many bus transactions versus one in invalidation case

» could delay updates or use merging

– Overall, trend is away from update based protocols as default

» bandwidth, complexity, large blocks trend, pack rat for process migration

– Will see later that updates have greater problems for scalable systems

LU/inv

LU/upd

Ocean/inv

Upgrade/update rate (%)

Upgrade/update rate (%)

Ocean/mix

Ocean/upd

Raytrace/inv

Raytrace/upd

0.0

0

0.5

0

1.0

0

1.5

0

2.0

0

2.5

0

Radix/inv

Radix/mix

Radix/upd

0.0

0

1.0

0

2.0

0

3.0

0

4.0

0

5.0

0

6.0

0

7.0

0

8.0

0

2/12/99 CS258 S99 48

Summary

• FSM describes Cache Coherence Algorithm– many underlying design choices

– prove coherence, consistency

• Evaluation must be based on sound understandng of workloads

– drive the factors you want to study

– representative

– scaling factors

• Use of workload driven evaluation to resolve architectural questions