using criticality to attack performance bottlenecks

Using Criticality to Attack Performance Bottlenecks

Brian FieldsUC-Berkeley

(Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Bottleneck Analysis

Bottleneck Analysis:Determining the performance effect of an

event on execution time

An event could be:• an instruction’s execution• an instruction-window-full stall• a branch mispredict• a network request• inter-processor communication• etc.

Why is Bottleneck Analysis Important?

Bottleneck Analysis Applications

Run-time Optimization• Resource arbitration

• e.g., how to scheduling memory accesses? • Effective speculation

• e.g., which branches to predicate?•Dynamic reconfiguration

• e.g, when to enable hyperthreading? • Energy efficiency

• e.g., when to throttle frequency?

Design Decisions• Overcoming technology constraints

• e.g., how to mitigate effect of long wire latencies?

Programmer Performance Tuning• Where have the cycles gone?

• e.g., which cache misses should be prefetched?

Why is Bottleneck Analysis Hard?

Current state-of-art

Event counts:Exe. time = (CPU cycles + Mem. cycles) * Clock cycle

timewhere:Mem. cycles = Number of cache misses * Miss penalty

miss11 (100 cycles) (100 cycles)

miss22 (100 cycles) (100 cycles)

2 misses but only 1 miss penalty

Parallelism in systems complicates performance understanding

Parallelism

• A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing

• Two parallel cache misses

• Two parallel threads

Criticality Challenges

• Cost• How much speedup possible from optimizing an

event?

• Slack• How much can an event be “slowed down” before

increasing execution time?

• Interactions• When do multiple events need to be optimized

simultaneously?

• When do we have a choice?

• Exploit in Hardware

Our Approach

Our Approach: Criticality

Critical events affect execution time, non-critical do not

Bottleneck Analysis:Determining the performance effect of an

event on execution time

Defining criticality

Need Performance Sensitivity

• slowing down a “critical” event should slow down the entire program

• speeding up a “noncritical” event should leave execution time unchanged

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Standard Waterfall Diagram

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Annotated with Dependence Edges

(MISP)

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Fetch BW

Data Dep

Branch Misp.

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Edge Weights Added

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

R5 = R5 + 100

R0 = R5

Ret R0

Convert to Graph

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

R5 = R5 + 100

R0 = R5

Ret R0

Convert to Graph

Smaller graph instance

E EE E

F F FF F

C C CC C

100 0 1

Non-critical,But how

much slack?

Critical Icache miss,

But how costly?

Add “hidden” constraints

E EE E1 11

F F FF F

C C CC C

1 11 1

100 0 1

100 1Non-critical,

But how much slack?

Critical Icache miss,

But how costly?

Add “hidden” constraints

E EE E1 11

F F FF F

C C CC C

1 11 1

100 0 1

100 1Slack = 13 – 7 = 6 cycles

Cost = 13 – 7 = 6 cycles

Slack “sharing”

E EE E1 11

F F FF F

C C CC C

1 11 1

100 0 1

100 1Slack = 6

cycles

Slack = 6 cycles

Can delay one edge by 6 cycles, but not both!

Machine Imbalance

0 10 20 30 40 50 60 70 80 90 100

Number of Cycles of Slack (perl)

apportioned

global

~80% insts have at least 5 cycles of apportioned

event?

simultaneously?

Simple criticality not always enough

Sometimes events have nearly equal criticality

miss #1 (99)

miss #2 (100)

Want to know • how critical is each event?

• how far from critical is each event?

Actually, even that is not enough

Our solution: measure interactions

Two parallel cache misses

miss #1 (99)

miss #2 (100)Cost(miss #1) = 0

Cost(miss #2) = 1

Cost({miss #1, miss #2}) = 100

Aggregate cost > Sum of individual costs Parallel interaction100 0 +

1icost = aggregate cost – sum of individual costs

= 100 – 0 – 1 = 99

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

2. Zero icost ?

1. Positive icost parallel

interaction

miss #1

miss #2

miss #1

miss #21. Positive icost

parallel interaction

2. Zero icost independent

miss #1 miss #2

3. Negative icost ?

Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

Cost(miss #1) = ?

ALU latency (110 cycles)

Negative icost

Cost(miss #1) = 90

Cost(miss #2) = 90

miss #1 (100)

miss #2 (100)

= 90 – 90 – 90 = -90Negative icost serial interaction

miss #1

miss #1 miss #2. . .

3. Negative icost serial

interaction

ALU latency

miss #1 miss #2

Branch mispredict

Fetch BW

Load-Replay Trap

LSQ stall

Why care about serial interactions?

miss #1 (100)

miss #2 (100)

Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)

Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1

Icost Case Study: Deep pipelines

Looking for serial interactions!

Dcache (DL1)

Icost Breakdown (6 wide, 64-entry window)

gcc gzip vortex

DL1+window

DL1+bw

DL1+bmisp

DL1+dmiss

DL1+alu

DL1+imiss

gcc gzip vortex

DL1 30.5 %

DL1+window

DL1+bw

DL1+bmisp

DL1+dmiss

DL1+alu

DL1+imiss

gcc gzip vortex

DL1 30.5 %

DL1+window -15.3

DL1+bw 6.0

DL1+bmisp -3.4

DL1+dmiss -0.4

DL1+alu -8.2

DL1+imiss 0.0

... ...

Total 100.0

gcc gzip vortex

DL1 18.3 % 30.5 % 25.8 %

DL1+window -4.2 -15.3 -24.5

DL1+bw 10.0 6.0 15.5

DL1+bmisp -7.0 -3.4 -0.3

DL1+dmiss -1.4 -0.4 -1.4

DL1+alu -1.6 -8.2 -4.7

DL1+imiss 0.1 0.0 0.4

... ... ... ...

Total 100.0 100.0 100.0

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

event?

simultaneously?

Exploit in Hardware

• Criticality Analyzer• Online, fast-feedback• Limited to critical/not critical

• Replacement for Performance Counters

• Requires offline analysis • Constructs entire graph

Only last-arriving edges can be critical

Observation: R1 R2 + R3

If dependence into R2 is on critical path, then value of R2 arrived last.

critical arrives last

arrives last critical

Dependence resolved early

Determining last-arrive edges

Observe events within the machine

last_arrive[F] =

last_arrive[E] =

FE if data ready on fetch

EE observe arrival order of operands

last_arrive[C] =

EC if commit pointer is delayed

CC otherwise

EF if branch misp.

CF if ROB stall

FF otherwise

Last-arrive edges

The last-arrive rule

CP consists only of “last-arrive” edges

Prune the graph

Only need to put last-arrive edges in graphNo other edges could be on CP

newest

…and we’ve found the critical path!

Backward propagate along last-arrive edges

newest

newest Found CP by only observing last-arrive

edges but still requires constructing entire

Step 2. Reducing storage reqs

CP is a ”long” chain of last-arrive edges. the longer a given chain of last-arrive

edges, the more likely it is part of the CP

Algorithm: find sufficiently long last-arrive chains

1. Plant token into a node n

2. Propagate forward, only along last-arrive edges

3. Check for token after several hundred cycles

4. If token alive, n is assumed critical

Online Criticality Detection

Forward propagate token

newest

PlantToken

newest

PlantToken

Tokens

“Die”

PlantToken

Token survives

Putting it all together

CP prediction

Last-arrive edges

(producer retired instr)

OOO CoreE-critical?

Training Path

Prediction Path

Token-PassingAnalyzer

Results• Performance (Speed)

• Scheduling in clustered machines• 10% speedup

• Selective value prediction• Deferred scheduling (Crowe, et al)

• 11% speedup

• Heterogeneous cache (Rakvic, et al.)• 17% speedup

• Energy• Non-uniform machine: fast and slow pipelines

• ~25% less energy

• Instruction queue resizing (Sasanka, et al.)• Multiple frequency scaling (Semeraro, et al.)

• 19% less energy with 3% less performance

• Selective pre-execution (Petric, et al.)

Exploit in Hardware

• Criticality Analyzer• Online, fast-feedback• Limited to critical/not critical

• Replacement for Performance Counters

• Requires offline analysis • Constructs entire graph

Profiling goal

Goal: • Construct graph

many dynamic instructions

Constraint:• Can only sample sparsely

Profiling goal

DNA strand

Genome sequencing

“Shotgun” genome sequencing

. . .. . .

. . . . . .

Find overlaps among samples

Mapping “shotgun” to our situation

Icache miss

Dcache missBranch misp.No event

. . .. . .

Profiler hardware requirements

. . .. . .

Match!

Sources of error

Error Source Gcc Parser Twolf

Modeling execution as a graph

2.1 % 6.0% 0.1 %

Errors in graph construction

5.3 % 1.5 % 1.6 %

Sampling only a few graph fragments

4.8 % 6.5 % 7.2 %

Total 12.2 % 14.0 % 8.9 %

Conclusion: Grand Challenges

• Cost• How much speedup possible from optimizing

an event?

• Slack• How much can an event be “slowed down”

before increasing execution time?

• Interactions• When do multiple events need to be

optimized simultaneously?

modeling

token-passing analyzer

parallel interactions

serial interactions

shotgun profiling

Conclusion: Bottleneck Analysis Applications

Run-time Optimization• Effective speculation

• Resource arbitration

• Dynamic reconfiguration

• Energy efficiency

Selective value prediction

Scheduling and steering in clustered processors

Resize instruction window

Non-uniform machines

Helped cope with high-latency dcache

Measured cost of cache misses/branch

mispredicts

Outline

Simple Criticality• Definition (ISCA ’01)

• Detection (ISCA ’01)

• Application (ISCA ’01-’02)

Advanced Criticality• Interpretation (MICRO ’03)

• What types of interactions are possible?

• Hardware Support (MICRO ’03, TACO ’04)

• Enhancement to performance counters

Simple criticality not always enough

Sometimes events have nearly equal criticality

miss #1 (99)

miss #2 (100)

Want to know • how critical is each event?

• how far from critical is each event?

Actually, even that is not enough

Two parallel cache misses

miss #1 (99)

miss #2 (100)

Cost(miss #1) = 0

Cost(miss #2) = 1

= 100 – 0 – 1 = 99

2. Zero icost ?

interaction

miss #1

miss #2

miss #1

miss #1 miss #2

3. Negative icost ?

Negative icost

miss #1 (100)

miss #2 (100)

Cost(miss #1) = ?

Negative icost

Cost(miss #1) = 90

Cost(miss #2) = 90

miss #1 (100)

miss #2 (100)

miss #1

interaction

ALU latency

miss #1 miss #2

Branch mispredict

Fetch BW

Load-Replay Trap

LSQ stall

miss #1 (100)

miss #2 (100)

Outline

Simple Criticality• Definition (ISCA ’01)

• Application (ISCA ’01-’02)

Advanced Criticality• Interpretation (MICRO ’03)

• What types of interactions are possible?

• Hardware Support (MICRO ’03, TACO ’04)

• Enhancement to performance counters

Profiling goal

DNA strand

Genome sequencing

. . .. . .

. . . . . .

Icache miss

. . .. . .

Match!

Sources of error

2.1 % 6.0% 0.1 %

Sources of error

2.1 % 6.0% 0.1 %

5.3 % 1.5 % 1.6 %

Sources of error

2.1 % 6.0% 0.1 %

5.3 % 1.5 % 1.6 %

4.8 % 6.5 % 7.2 %

Total 12.2 % 14.0 % 8.9 %

Conclusion: Bottleneck Analysis Applications

Run-time Optimization• Effective speculation

• Resource arbitration

• Dynamic reconfiguration

• Energy efficiency

Selective value prediction

Scheduling and steering in clustered processors

Resize instruction window

Non-uniform machines

Helped cope with high-latency dcache

Measured cost of cache misses/branch

mispredicts

Conclusion: Grand Challenges

• Cost• How much speedup possible from optimizing

an event?

• Slack• How much can an event be “slowed down”

before increasing execution time?

• Interactions• When do multiple events need to be

optimized simultaneously?

modeling

token-passing analyzer

parallel interactions

serial interactions

shotgun profiling

Backup Slides

Related Work

Criticality Prior Work

Critical-Path Method, PERT charts• Developed for Navy’s “Polaris” project-1957

• Used as a project management tool

• Simple critical-path, slack concepts

“Attribution” Heuristics• Rosenblum et al.: SOSP-1995, and many others

• Marks instruction at head of ROB as critical, etc.

• Empirically, has limited accuracy

• Does not account for interactions between events

Related Work: Microprocessor Criticality

Latency tolerance analysis• Srinivasan and Lebeck: MICRO-1998

Heuristics-driven criticality predictors• Tune et al.: HPCA-2001• Srinivasan et al.: ISCA-2001

“Local” slack detector• Casmira and Grunwald: Kool Chips Workshop-

ProfileMe with pair-wise sampling• Dean, et al.: MICRO-1997

Unresolved Issues

Alternative I: Addressing Unresolved Issues

Modeling and Measurement• What resources can we model effectively?

• difficulty with mutual-exclusion-type resouces (ALUs)

• Efficient algorithms

• Release tool for measuring cost/slack

Hardware • Detailed design for criticality analyzer

• Shotgun profiler simplifications• gradual path from counters

Optimization • explore heuristics for exploiting interactions

Alternative II: Chip-Multiprocessors

Design Decisions• Should each core support out-of-order execution?• Should SMT be supported?• How many processors are useful?• What is the effect of inter-processor latency?

Programmer Performance TuningParallelizing applications

• What makes a good division into threads?• How can we find them automatically, or at least help programmers to find them?

Unresolved issuesModeling and Measurement

• What resources can we model effectively?• difficulty with mutual-exclusion-type resouces (ALUs)

• In other words, unanticipated side effects

1. ld r2, [Mem]2. add r3 r2 + 13. ld r4, [Mem]4. add r6 r4 + 1

(cache miss)

1 10 10

Original Execution

(cache miss)

(cache hit)Nocontention

1. ld r2, [Mem]2. add r3 r2 + 13. ld r4, [Mem]4. add r6 r4 + 1

10 1 12

Altered Execution(to compute cost of inst #3

cache miss)

Adder contention

Contention edge

Incorrect critical path due to contention edge

Should not be here

Unresolved issues

Modeling and Measurement (cont.)

• How should processor policies be modeled?• relationship to icost definition

• Efficient algorithms for measuring icosts• pairs of events, etc.

• Release tool for measuring cost/slack

Unresolved issues

Hardware • Detailed design for criticality analyzer

• help to convince industry-types to build it

• Shotgun profiler simplifications• gradual path from counters

Optimization • Explore icost optimization heuristics

• icosts are difficult to interpret

Validation

Validation: can we trust our model?

Run two simulations :

• Reduce CP latencies

• Reduce non-CP latencies

Expect “big” speedup

Expect no speedup

Validation: can we trust our model?

crafty eon gcc gzip perl vortex galgel mesaSp

Reducing CP Latencies

Reducing non-CP Latencies

Validation

Two steps:

1. Increase latencies of insts. by their apportioned slack

• for three apportioning strategies:1) latency+1,2) 5-cycles to as many instructions

as possible, 3) 12-cycles to as many loads as

possible

2. Compare to baseline (no delays inserted)

Validation

ammp art gcc gzip mesa parser perl vortex average

baseline

latency + 1

12 cycles to loads

five cycles

Worst case: Inaccuracy of 0.6%

Slack Measurements

Three slack variants

Local slack:# cycles latency can be increased

without delaying any subsequent instructions

Global slack:# cycles latency can be increased

without delaying the last instruction in the program

Apportioned slack:Distribute global slack among instructions

using an apportioning strategy

Slack measurements

0 10 20 30 40 50 60 70 80 90 100

~21% insts have at least 5 cycles of local slack

Slack measurements

0 10 20 30 40 50 60 70 80 90 100

~90% insts have at least 5 cycles of global slack

global

Slack measurements

0 10 20 30 40 50 60 70 80 90 100

~80% insts have at least 5 cycles of apportioned

apportioned

global

A large amount of exploitable slack exists

Application-centered Slack Measurements

Load slack

Can we tolerate a long-latency L1 hit?

design: wire-constrained machine, e.g. Grid

non-uniformity: multi-latency L1

apportioning strategy:apportion ALL slack to load

instructions

Apportion all slack to loads

0 10 20 30 40 50 60 70 80 90 100

Number of Cycles of Slack on Load Instructions

t of D

gccperl

Most loads can tolerate an L2 cache hit

Multi-speed ALUs

Can we tolerate ALUs running at half frequency?

design: fast/slow ALUs

non-uniformity: multi-latency execution latency,

bypassapportioning strategy:

give slack equal to original latency + 1

Latency+1 apportioning

0%10%20%30%40%50%60%70%80%90%

ammp art gcc gzip mesa parser perl vortex averagePerc

Most instructions can tolerate doubling their latency

Slack Locality and Prediction

Predicting slack

Two steps to PC-indexed, history-based prediction:

1. Measure slack of a dynamic instruction2. Store in array indexed by PC of static instruction

Two requirements:

1. Locality of slack2. Ability to measure slack of a dynamic instruction

Locality of slack

PC-indexed, history-based predictor

can capture most of the available slack

Slack Detector

Problem #2Determining if overall execution time increased

SolutionCheck if delay made instruction critical

delay and observe effective for hardware predictor

Problem #1Iterating repeatedly over same dynamic instruction

SolutionOnly sample dynamic instruction once

Slack Detector

Goal: Determine whether instruction has n cycles of slack

1. Delay the instruction by n cycles2. Check if critical (via critical-path analyzer)

3. No, instruction has n cycles of slack 4. Yes, instruction does not have n cycles of slack

delay and observe

Slack Application

Fast/slow cluster microarchitecture

Data Cache

WIN Reg

Fast, 3-wide cluster

Slow, 3-wide cluster

Fetch + Rename

Aggressive non-uniform design:

• Higher execution latencies

• Increased (cross-domain) bypass latency

• Decreased effective issue bandwidth

Bypass Bus

save ~37% core power

Picking bins for the slack predictor

Use implicit slack predictor with four bins:

1. Steer to fast cluster + schedule with high priority2. Steer to fast cluster + schedule with low priority 3. Steer to slow cluster + schedule with high

priority4. Steer to slow cluster + schedule with low priority

Two decisions

1. Steer to fast/slow cluster

2. Schedule with high/low priority within a cluster

Slack-based policies

2 fast, high-power clustersslack-based

policyreg-dep steering

10% better performance from hiding non-uniformities

CMP case study

Multithreaded Execution Case Study

Two questions:

• How should a program be divided into threads?• what makes a good cutpoint?

• how can we find them automatically, or at least help programmers find them?

• What should a multiple-core design look like?• should each core support out-of-order execution?

• should SMT be supported?

• how many processors are useful?

• what is the effect of inter-processor latency?

Parallelizing an application

Why parallelize a single-thread application?

• Legacy code, large code bases

• Difficult to parallelize apps• Interpreted code, kernels of operating systems

• Like to use better programming languages• Scheme, Java instead of C/C++

Simplifying assumption• Program binary unchanged

Simplified problem statement• Given a program of length L, find a cutpoint that

divides the program into two threads that provides maximum speedup

Must consider:

• data dependences, execution latencies, control dependences, proper load balancing

Naive solution:• try every possible cutpoint

Our solution:• efficiently determine the effect of every

possible cutpoint

• model execution before and after every cut

Solution

last instruction

first instruction

0 1 0 1 0 1 0

Considerations:• Synchronization overhead

• add latency to EE edges

• Synchronization may involve turning EE to EF • Scheduling of threads

• additional CF edges

Challenges:• State behavior (one thread to multiple

processors)• caches, branch predictor

• Control behavior• limits where cutpoints can be made

More general problem:• Divide a program into N threads

• NP-complete

Icost can help:• icost(p1,p2) << 0 implies p1 and p2 redundant

• action: move p1 and p2 further apart

Preliminary Results

Experimental Setup• Simulator, based loosely on SimpleScalar

• Alpha SpecInt binaries

Procedure1. Assume execution trace is known

2. Look at each 1k run of instructions

3. Test every possible cutpoint using 1k graphs

Dynamic Cutpoints

Cost Distribution of Dynamic Cutpoints

0 20 40 60 80 100

Execution time reduction (cycles)

ts bzip

crafty

parser

Only 20% of cuts yield benefits of > 20 cycles

Usefulness of cost-based policy

Speedups from parallelizing programs for a two-processor system

bzip crafty eon gap gcc gzip mcf parser perl twolf vpr

fixed-interval

simple cost-based

Static Cutpoints

Cost Distribution of Static Cutpoints

0 20 40 60 80 100 120 140 160 180

Avg. per-dynamic-instance Cost of Static Instructions

ns bzip

crafty

parser

tw olf

Up to 60% of cuts yield benefits of > 20 cycles

Future Avenues of Research

• Map cutpoints back to actual code• Compare automatically generated cutpoints to

human-generated ones• See what performance gains are in a simulator, as

opposed to just on the graph

• Look at the effect of synchronization operations• What additional overhead do they introduce?

• Deal with state, control problems• Might need some technique outside of the graph

Multithreaded Execution Case Study

Two possible questions:

• How should a program be divided into threads?• what makes a good cutpoint?

• how can we find them automatically, or at least help programmers find them?

• What should a multiple-core design look like?• should each core support out-of-order execution?

• should SMT be supported?

• how many processors are useful?

• what is the effect of inter-processor latency?

CMP design study

What we can do:

• Try out many configurations quickly• dramatic changes in architecture often only small

changes in graph

• Identifying bottlenecks• especially interactions

CMP design study: Out-of-orderness

Is out-of-order execution necessary in a CMP?

Procedure• model execution with different configurations

• adjust CD edges

• compute breakdowns• notice resource/events interacting with CD edges

last instruction

first instruction

0 1 0 1 0 1 0

Results summary• Single-core: Performance taps out at 256 entries• CMP: Performance gains up through 1024 entries

• some benchmarks see gains up to 16k entries

Why more beneficial?• Use breakdowns to find out.....

Components of window cost• cache misses holding up retirement?• long strands of data dependencies?• predictable control flow?

Icost breakdowns give quantitative and qualitative answers

cost(window) + icost(window, A) + icost(window, B) + icost(window, AB) = 0

window cost

cachemisses

Independent

cachemisses

interaction

Parallel Interaction

cachemisses

interaction

Serial Interaction

Summary of Preliminary Results

icost(window, ALU operations) << 0• primarily communication between processors

• window often stalled waiting for data

Implications• larger window may be overkill

• need a cheap non-blocking solution• e.g., continual-flow pipelines

CMP design study: SMT?

Benefits• reduced thread start-up latency

• reduced communication costs

How we could help• distribution of thread lengths

• breakdowns to understand effect of communication

Start #1

CMP design study: How many processors?

CMP design study: Other Questions

What is the effect of inter-processor communication latency?• understand hidden vs. exposed communication

Allocating processors to programs• methodology for O/S to better assign programs

to processors

Waterfall To Graph Story

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Standard Waterfall Diagram

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Fetch BW

Data Dep

Branch Misp.

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Edge Weights Added

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

R5 = R5 + 100

R0 = R5

Ret R0

Convert to Graph

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

R5 = R5 + 100

R0 = R5

Ret R0

Find Critical Path

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

R5 = R5 + 100

R0 = R5

Ret R0

Add Non-last-arriving Edges

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

R5 = R5 + 100

R0 = R5

Ret R0

Branch misprediction made correct

Graph Alterations

Token-passing analyzer

Step 1. Observing

Observation: R1 R2 + R3

If dependence into R2 is on critical path, then value of R2 arrived last.

critical arrives last

arrives last critical

Dependence resolved early

Determining last-arrive edges

Observe events within the machine

last_arrive[F] =

last_arrive[E] =

FE if data ready on fetch

EE observe arrival order of operands

last_arrive[C] =

EC if commit pointer is delayed

CC otherwise

EF if branch misp.

CF if ROB stall

FF otherwise

Last-arrive edges: a CPU stethoscope

E E F E C F

Last-arrive edges

0 1 0 1 0 1 0

00 0 0

Remove latencies

Do not need explicit weights

Last-arrive edges

The last-arrive rule

CP consists only of “last-arrive” edges

Prune the graph

Only need to put last-arrive edges in graphNo other edges could be on CP

newest

…and we’ve found the critical path!

Backward propagate along last-arrive edges

newest

newest Found CP by only observing last-arrive

edges but still requires constructing entire

Step 2. Efficient analysis

CP is a ”long” chain of last-arrive edges. the longer a given chain of last-arrive

edges, the more likely it is part of the CP

Algorithm: find sufficiently long last-arrive chains

1. Plant token into a node n

2. Propagate forward, only along last-arrive edges

3. Check for token after several hundred cycles

4. If token alive, n is assumed critical

1. plant token

Token-passing example

2. propagate token

3. is token alive?

4. yes, train critical

Critical

Found CP without constructing entire graph

ROB Size

Implementation: a small SRAM array

Last-arrive producer node (inst id, type)

Token Queue

Commited (inst id, type)

Size of SRAM: 3 bits ROB size < 200 Bytes

Simply replicate for additional tokens

Putting it all together

CP prediction

Last-arrive edges

(producer retired instr)

OOO CoreE-critical?

Training Path

Prediction Path

Token-PassingAnalyzer

Scheduling and Steering

Case Study #1: Clustered architectures

steering

issue window

scheduling1. Current state of art

(Base)2. Base + CP

Scheduling3. Base + CP Scheduling + CP Steering

eoncrafty gcc gzip perl vortex galgel mesa

unclustered

2 cluster

4 cluster

Current State of the Art

Avg. clustering penalty for 4 clusters: 19%

Constant issue width, clock frequency

unclustered

2 cluster

4 cluster

CP Optimizations

Base + CP Scheduling

unclustered

2 cluster

4 cluster

CP Optimizations

Avg. clustering penalty reduced from 19% to 6%

Base + CP Scheduling + CP Steering

Token-passing Vs. Heuristics

Local Vs. Global Analysis

crafty eon gcc gzip perl vortex galgel mesa

oldest-uncommited

oldest-unissued

token-passing

Previous CP predictors:local resource-sensitive predictions (HPCA 01, ISCA

CP exploitation seems to require global analysis

Icost case study

Deep pipelines cause long latency loops:• level-one (DL1) cache access,

issue-wakeup, branch misprediction, …

But can often mitigate them indirectlyAssume 4-cycle DL1 access; how to mitigate?

Increase cache ports? Increase window size?

Increase fetch BW? Reduce cache misses?

Really, looking for serial interactions!

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

E E EE E

F F FF F

C C CC C

9 18 7 6 7

1 12 1 0 12

i1 i2 i3 i4 i5 i6

DL1 access

window edge

gcc gzip vortex

DL1+window

DL1+bw

DL1+bmisp

DL1+dmiss

DL1+alu

DL1+imiss

gcc gzip vortex

DL1 30.5 %

DL1+window

DL1+bw

DL1+bmisp

DL1+dmiss

DL1+alu

DL1+imiss

gcc gzip vortex

DL1 30.5 %

DL1+window -15.3

DL1+bw 6.0

DL1+bmisp -3.4

DL1+dmiss -0.4

DL1+alu -8.2

DL1+imiss 0.0

... ...

Total 100.0

gcc gzip vortex

DL1 18.3 % 30.5 % 25.8 %

DL1+window -4.2 -15.3 -24.5

DL1+bw 10.0 6.0 15.5

DL1+bmisp -7.0 -3.4 -0.3

DL1+dmiss -1.4 -0.4 -1.4

DL1+alu -1.6 -8.2 -4.7

DL1+imiss 0.1 0.0 0.4

... ... ... ...

Total 100.0 100.0 100.0

Vortex Breakdowns, enlarging the window

64 128 256

DL1+window

DL1+bw

DL1+bmisp

DL1+dmiss

DL1+alu

DL1+imiss

Vortex Breakdowns, enlarging the window

64 128 256

DL1 25.8 8.9 3.9

DL1+window

-24.5 -7.7 -2.6

DL1+bw 15.5 16.7 13.2

DL1+bmisp -0.3 -0.6 -0.8

DL1+dmiss -1.4 -2.1 -2.8

DL1+alu -4.7 -2.5 -0.4

DL1+imiss 0.4 0.5 0.3

... ... ... ...

Total 100.0 80.8 75.0

Shotgun Profiling

Profiling goal

DNA strand

Genome sequencing

. . .. . .

. . . . . .

Icache miss

. . .. . .

Match!

Offline Profiler Algorithm

long sample

detailed samples

Design issues

Identify microexecution context

• Choosing signature bits

• Determining PCs (for better detailed sample matching) long

sampleStart PC121620245660 . . .

branchencode taken/not-taken bit in signature

Sources of error

Building graph fragments

Sources of error

5.3 % 1.5 % 1.6 %

Sources of error

5.3 % 1.5 % 1.6 %

4.8 % 6.5 % 7.2 %

Sources of error

5.3 % 1.5 % 1.6 %

4.8 % 6.5 % 7.2 %

2.1 % 6.0% 0.1 %

Sources of error

5.3 % 1.5 % 1.6 %

4.8 % 6.5 % 7.2 %

2.1 % 6.0% 0.1 %

Total 12.2 % 14.0 % 8.9 %

Icost vs. Sensitivity Study

Compare Icost and Sensitivity Study

Corollary to DL1 and ROB serial interaction:As load latency increases, the benefit from enlarging the ROB increases.

E E EE E

F F FF F

C C CC C

1 2 3 2 3

0 1 0 1 1

i1 i2 i3 i4 i5 i6

DL1 access

64 128 192 256

ROB size

DL1 Latency

Sensitivity Study Advantages• More information

• e.g., concave or convex curves

Interaction Cost Advantages• Easy (automatic) interpretation

• Sign and magnitude have well defined meanings

• Concise communication• DL1 and ROB interact serially

Outline

• Definition (ISCA ’01)

• what does it mean for an event to be critical?

• how can we determine what events are critical?

• Interpretation (MICRO ’04, TACO ’04)

• what does it mean for two events to interact?

• Application (ISCA ’01-’02, TACO ’04)

• how can we exploit criticality in hardware?

Two parallel cache misses (Each 100 cycles)

miss #1 (100)miss #2 (100)

Cost(miss #1) = 0

Cost(miss #2) = 0

= 100 – 0 – 0 = 100

2. Zero icost ?

interaction

miss #1

miss #2

miss #1

miss #1 miss #2

3. Negative icost ?

Negative icost

miss #1 (100)

miss #2 (100)

Cost(miss #1) = ?

Negative icost

Cost(miss #1) = 90

Cost(miss #2) = 90

miss #1 (100)

miss #2 (100)

miss #1

interaction

ALU latency

miss #1 miss #2

Branch mispredict

Fetch BW

Load-Replay Trap

LSQ stall

miss #1 (100)

miss #2 (100)

Outline

• Definition (ISCA ’01)

• what does it mean for an event to be critical?

• how can we determine what events are critical?

• Interpretation (MICRO ’04, TACO ’04)

• what does it mean for two events to interact?

• Application (ISCA ’01-’02, TACO ’04)

• how can we exploit criticality in hardware?

Criticality Analyzer (ISCA ‘01)

Procedure

1. Observe last-arriving edges

• uses simple rules

2. Propagate a token forward along last-arriving edges

• at worst, a read-modify-write sequence to a small array

3. If token dies, non-critical; otherwise, critical

• Detect criticality of dynamic instructions

Slack Analyzer (ISCA ‘02)

• Detect likely slack of static instructions

Procedure

1. Delay the instruction by n cycles2. Check if critical (via critical-path analyzer)

• No, instruction has n cycles of slack • Yes, instruction does not have n cycles of

Shotgun Profiling (TACO ‘04)

• Create representative graph fragments

Procedure

• Enhance ProfileMe counters with context

• Use context to piece together counter samples

using criticality to attack performance bottlenecks

execution timean event

performance effect

noncritical event

instructions execution

notbottleneck analysis

analysis important

analysis hard

programmer performance

Documents

eliminating the production bottlenecks

criticality long

criticality safety

bottlenecks and solutions

equipment criticality analysis.doc

criticality calculations and criticality monitoring

deconfined quantum criticality

performance bottlenecks

bottlenecks exposed

embedded multi-core systems for mixed criticality ... ·...

climate change & extreme weather pilot … molden.pdfclimate...

exploiting criticality to reduce bottlenecks in distributed...

criticality avoidance

locating internet bottlenecks

addressing procurement bottlenecks -...

criticality of mineral raw materials and sustainability...

quantum criticality

visual analytics of cascaded bottlenecks in planar flow...

equipment criticality

bottlenecks volcanic winter