using criticality to attack performance bottlenecks brian fields uc-berkeley (collaborators:...

233
Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Upload: silvia-rose

Post on 27-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Using Criticality to Attack Performance Bottlenecks

Brian FieldsUC-Berkeley

(Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Page 2: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Bottleneck Analysis

Bottleneck Analysis:Determining the performance effect of an

event on execution time

An event could be:• an instruction’s execution• an instruction-window-full stall• a branch mispredict• a network request• inter-processor communication• etc.

Page 3: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Why is Bottleneck Analysis Important?

Page 4: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Bottleneck Analysis Applications

Run-time Optimization• Resource arbitration

• e.g., how to scheduling memory accesses? • Effective speculation

• e.g., which branches to predicate?•Dynamic reconfiguration

• e.g, when to enable hyperthreading? • Energy efficiency

• e.g., when to throttle frequency?

Design Decisions• Overcoming technology constraints

• e.g., how to mitigate effect of long wire latencies?

Programmer Performance Tuning• Where have the cycles gone?

• e.g., which cache misses should be prefetched?

Page 5: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Why is Bottleneck Analysis Hard?

Page 6: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Current state-of-art

Event counts:Exe. time = (CPU cycles + Mem. cycles) * Clock cycle

timewhere:Mem. cycles = Number of cache misses * Miss penalty

miss11 (100 cycles) (100 cycles)

miss22 (100 cycles) (100 cycles)

2 misses but only 1 miss penalty

Page 7: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Parallelism in systems complicates performance understanding

Parallelism

• A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing

• Two parallel cache misses

• Two parallel threads

Page 8: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Criticality Challenges

• Cost• How much speedup possible from optimizing an

event?

• Slack• How much can an event be “slowed down” before

increasing execution time?

• Interactions• When do multiple events need to be optimized

simultaneously?

• When do we have a choice?

• Exploit in Hardware

Page 9: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Our Approach

Page 10: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Our Approach: Criticality

Critical events affect execution time, non-critical do not

Bottleneck Analysis:Determining the performance effect of an

event on execution time

Page 11: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Defining criticality

Need Performance Sensitivity

• slowing down a “critical” event should slow down the entire program

• speeding up a “noncritical” event should leave execution time unchanged

Page 12: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

F E C

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Standard Waterfall Diagram

Page 13: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

F E C

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Annotated with Dependence Edges

(MISP)

Page 14: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

F E C

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Fetch BW

ROB

Data Dep

Branch Misp.

Annotated with Dependence Edges

Page 15: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

F E C

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

1

1

1

1

11

3

1 1

2

1

0

1

Edge Weights Added

Page 16: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

bf L1

R5 = R5 + 100

R0 = R5

Ret R0

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

1

1

1

1

1

1

2

1

11

1

3

0

1

0

0

0

0

Convert to Graph

1

1

1

11

1

1

2

1

1

1

1

1

2

1 1

11

1

2

1

1

Page 17: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

bf L1

R5 = R5 + 100

R0 = R5

Ret R0

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

1

1

1

1

1

1

2

1

11

1

3

0

1

0

0

0

0

Convert to Graph

1

1

1

11

1

1

2

1

1

1

1

1

2

1 1

11

1

2

1

1

Page 18: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Smaller graph instance

E1

E EE E

3

F F FF F

C C CC C

1

11 1

1

1

1 1

100 0 1

1

Non-critical,But how

much slack?

1

Critical Icache miss,

But how costly?

Page 19: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Add “hidden” constraints

E1

E EE E1 11

1 2

3

F F FF F

C C CC C

1

1 11 1

1

1

11 1

100 0 1

100 1Non-critical,

But how much slack?

Critical Icache miss,

But how costly?

Page 20: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Add “hidden” constraints

E1

E EE E1 11

1 2

3

F F FF F

C C CC C

1

1 11 1

1

1

11 1

100 0 1

100 1Slack = 13 – 7 = 6 cycles

Cost = 13 – 7 = 6 cycles

Page 21: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slack “sharing”

E1

E EE E1 11

1 2

3

F F FF F

C C CC C

1

1 11 1

1

1

11 1

100 0 1

100 1Slack = 6

cycles

Slack = 6 cycles

Can delay one edge by 6 cycles, but not both!

Page 22: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Machine Imbalance

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Number of Cycles of Slack (perl)

Perc

ent o

f Dyn

amic

Inst

ruct

ions

apportioned

global

~80% insts have at least 5 cycles of apportioned

slack

Page 23: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Criticality Challenges

• Cost• How much speedup possible from optimizing an

event?

• Slack• How much can an event be “slowed down” before

increasing execution time?

• Interactions• When do multiple events need to be optimized

simultaneously?

• When do we have a choice?

• Exploit in Hardware

Page 24: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Simple criticality not always enough

Sometimes events have nearly equal criticality

miss #1 (99)

miss #2 (100)

Want to know • how critical is each event?

• how far from critical is each event?

Actually, even that is not enough

Page 25: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Our solution: measure interactions

Two parallel cache misses

miss #1 (99)

miss #2 (100)Cost(miss #1) = 0

Cost(miss #2) = 1

Cost({miss #1, miss #2}) = 100

Aggregate cost > Sum of individual costs Parallel interaction100 0 +

1icost = aggregate cost – sum of individual costs

= 100 – 0 – 1 = 99

Page 26: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

2. Zero icost ?

1. Positive icost parallel

interaction

miss #1

miss #2

Page 27: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

miss #1

miss #21. Positive icost

parallel interaction

2. Zero icost independent

miss #1 miss #2

. . .

3. Negative icost ?

Page 28: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

Cost(miss #1) = ?

ALU latency (110 cycles)

Page 29: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Negative icost

Two serial cache misses (data dependent)

Cost(miss #1) = 90

Cost(miss #2) = 90

Cost({miss #1, miss #2}) = 90

ALU latency (110 cycles)

miss #1 (100)

miss #2 (100)

icost = aggregate cost – sum of individual costs

= 90 – 90 – 90 = -90Negative icost serial interaction

Page 30: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

miss #1

miss #21. Positive icost

parallel interaction

2. Zero icost independent

miss #1 miss #2. . .

3. Negative icost serial

interaction

ALU latency

miss #1 miss #2

Branch mispredict

Fetch BW

Load-Replay Trap

LSQ stall

Page 31: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Why care about serial interactions?

ALU latency (110 cycles)

miss #1 (100)

miss #2 (100)

Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)

Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1

Page 32: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

Looking for serial interactions!

Dcache (DL1)

1 4

Page 33: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Breakdown (6 wide, 64-entry window)

gcc gzip vortex

DL1

DL1+window

DL1+bw

DL1+bmisp

DL1+dmiss

DL1+alu

DL1+imiss

...

Total

Page 34: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Breakdown (6 wide, 64-entry window)

gcc gzip vortex

DL1 30.5 %

DL1+window

DL1+bw

DL1+bmisp

DL1+dmiss

DL1+alu

DL1+imiss

...

Total

Page 35: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Breakdown (6 wide, 64-entry window)

gcc gzip vortex

DL1 30.5 %

DL1+window -15.3

DL1+bw 6.0

DL1+bmisp -3.4

DL1+dmiss -0.4

DL1+alu -8.2

DL1+imiss 0.0

... ...

Total 100.0

Page 36: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Breakdown (6 wide, 64-entry window)

gcc gzip vortex

DL1 18.3 % 30.5 % 25.8 %

DL1+window -4.2 -15.3 -24.5

DL1+bw 10.0 6.0 15.5

DL1+bmisp -7.0 -3.4 -0.3

DL1+dmiss -1.4 -0.4 -1.4

DL1+alu -1.6 -8.2 -4.7

DL1+imiss 0.1 0.0 0.4

... ... ... ...

Total 100.0 100.0 100.0

Page 37: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 38: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 39: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 40: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 41: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 42: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 43: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Criticality Challenges

• Cost• How much speedup possible from optimizing an

event?

• Slack• How much can an event be “slowed down” before

increasing execution time?

• Interactions• When do multiple events need to be optimized

simultaneously?

• When do we have a choice?

• Exploit in Hardware

Page 44: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Exploit in Hardware

• Criticality Analyzer• Online, fast-feedback• Limited to critical/not critical

• Replacement for Performance Counters

• Requires offline analysis • Constructs entire graph

Page 45: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Only last-arriving edges can be critical

Observation: R1 R2 + R3

If dependence into R2 is on critical path, then value of R2 arrived last.

critical arrives last

arrives last critical

E

R2

R3

Dependence resolved early

Page 46: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Determining last-arrive edges

Observe events within the machine

last_arrive[F] =

last_arrive[E] =

E

F

CC

E

F

CC

FE if data ready on fetch

E

F

CC

E

F

CC

E

F

CC

EE observe arrival order of operands

E

F

CC

E

F

CC

last_arrive[C] =

EC if commit pointer is delayed

CC otherwise

E

F

CC

E

F

CC

E

F

CC

E

F

CC

E

F

CC

E

F

CC

EF if branch misp.

E

F

CC

E

F

CC

E

F

CC

E

F

CC

CF if ROB stall

FF otherwise

Page 47: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Last-arrive edges

The last-arrive rule

CP consists only of “last-arrive” edges

F

E

C

Page 48: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Prune the graph

Only need to put last-arrive edges in graphNo other edges could be on CP

F

E

C

newest

Page 49: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

…and we’ve found the critical path!

Backward propagate along last-arrive edges

newest

F

E

C

newest Found CP by only observing last-arrive

edges but still requires constructing entire

graph

Page 50: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Step 2. Reducing storage reqs

CP is a ”long” chain of last-arrive edges. the longer a given chain of last-arrive

edges, the more likely it is part of the CP

Algorithm: find sufficiently long last-arrive chains

1. Plant token into a node n

2. Propagate forward, only along last-arrive edges

3. Check for token after several hundred cycles

4. If token alive, n is assumed critical

Page 51: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Online Criticality Detection

Forward propagate token

newest

F

E

C

newest

PlantToken

Page 52: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Online Criticality Detection

Forward propagate token

newest

F

E

C

newest

PlantToken

Tokens

“Die”

Page 53: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Online Criticality Detection

Forward propagate token

F

E

C

PlantToken

Token survives

!

Page 54: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Putting it all together

CP prediction

table

Last-arrive edges

(producer retired instr)

OOO CoreE-critical?

Training Path

PC

Prediction Path

Token-PassingAnalyzer

Page 55: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Results• Performance (Speed)

• Scheduling in clustered machines• 10% speedup

• Selective value prediction• Deferred scheduling (Crowe, et al)

• 11% speedup

• Heterogeneous cache (Rakvic, et al.)• 17% speedup

• Energy• Non-uniform machine: fast and slow pipelines

• ~25% less energy

• Instruction queue resizing (Sasanka, et al.)• Multiple frequency scaling (Semeraro, et al.)

• 19% less energy with 3% less performance

• Selective pre-execution (Petric, et al.)

Page 56: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Exploit in Hardware

• Criticality Analyzer• Online, fast-feedback• Limited to critical/not critical

• Replacement for Performance Counters

• Requires offline analysis • Constructs entire graph

Page 57: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Profiling goal

Goal: • Construct graph

many dynamic instructions

Constraint:• Can only sample sparsely

Page 58: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Profiling goal

Goal: • Construct graph

Constraint:• Can only sample sparsely

DNA

DNA strand

Genome sequencing

Page 59: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

DNA

Page 60: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

DNA

Page 61: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

. . .. . .

DNA

Page 62: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

. . .. . .

. . . . . .

Find overlaps among samples

DNA

Page 63: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Mapping “shotgun” to our situation

many dynamic instructions

Icache miss

Dcache missBranch misp.No event

Page 64: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

. . .. . .

Profiler hardware requirements

Page 65: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

. . .. . .

Profiler hardware requirements

Match!

Page 66: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Modeling execution as a graph

2.1 % 6.0% 0.1 %

Errors in graph construction

5.3 % 1.5 % 1.6 %

Sampling only a few graph fragments

4.8 % 6.5 % 7.2 %

Total 12.2 % 14.0 % 8.9 %

Page 67: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Conclusion: Grand Challenges

• Cost• How much speedup possible from optimizing

an event?

• Slack• How much can an event be “slowed down”

before increasing execution time?

• Interactions• When do multiple events need to be

optimized simultaneously?

• When do we have a choice?

modeling

token-passing analyzer

parallel interactions

serial interactions

shotgun profiling

Page 68: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Conclusion: Bottleneck Analysis Applications

Run-time Optimization• Effective speculation

• Resource arbitration

• Dynamic reconfiguration

• Energy efficiency

Design Decisions• Overcoming technology constraints

Programmer Performance Tuning• Where have the cycles gone?

Selective value prediction

Scheduling and steering in clustered processors

Resize instruction window

Non-uniform machines

Helped cope with high-latency dcache

Measured cost of cache misses/branch

mispredicts

Page 69: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Outline

Simple Criticality• Definition (ISCA ’01)

• Detection (ISCA ’01)

• Application (ISCA ’01-’02)

Advanced Criticality• Interpretation (MICRO ’03)

• What types of interactions are possible?

• Hardware Support (MICRO ’03, TACO ’04)

• Enhancement to performance counters

Page 70: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Simple criticality not always enough

Sometimes events have nearly equal criticality

miss #1 (99)

miss #2 (100)

Want to know • how critical is each event?

• how far from critical is each event?

Actually, even that is not enough

Page 71: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Our solution: measure interactions

Two parallel cache misses

miss #1 (99)

miss #2 (100)

Cost(miss #1) = 0

Cost(miss #2) = 1

Cost({miss #1, miss #2}) = 100

Aggregate cost > Sum of individual costs Parallel interaction100 0 +

1icost = aggregate cost – sum of individual costs

= 100 – 0 – 1 = 99

Page 72: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

2. Zero icost ?

1. Positive icost parallel

interaction

miss #1

miss #2

Page 73: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

miss #1

miss #21. Positive icost

parallel interaction

2. Zero icost independent

miss #1 miss #2

. . .

3. Negative icost ?

Page 74: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

Cost(miss #1) = ?

ALU latency (110 cycles)

Page 75: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Negative icost

Two serial cache misses (data dependent)

Cost(miss #1) = 90

Cost(miss #2) = 90

Cost({miss #1, miss #2}) = 90

ALU latency (110 cycles)

miss #1 (100)

miss #2 (100)

icost = aggregate cost – sum of individual costs

= 90 – 90 – 90 = -90Negative icost serial interaction

Page 76: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

miss #1

miss #21. Positive icost

parallel interaction

2. Zero icost independent

miss #1 miss #2. . .

3. Negative icost serial

interaction

ALU latency

miss #1 miss #2

Branch mispredict

Fetch BW

Load-Replay Trap

LSQ stall

Page 77: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Why care about serial interactions?

ALU latency (110 cycles)

miss #1 (100)

miss #2 (100)

Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)

Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1

Page 78: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Outline

Simple Criticality• Definition (ISCA ’01)

• Detection (ISCA ’01)

• Application (ISCA ’01-’02)

Advanced Criticality• Interpretation (MICRO ’03)

• What types of interactions are possible?

• Hardware Support (MICRO ’03, TACO ’04)

• Enhancement to performance counters

Page 79: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Profiling goal

Goal: • Construct graph

many dynamic instructions

Constraint:• Can only sample sparsely

Page 80: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Profiling goal

Goal: • Construct graph

Constraint:• Can only sample sparsely

DNA

DNA strand

Genome sequencing

Page 81: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

DNA

Page 82: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

DNA

Page 83: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

. . .. . .

DNA

Page 84: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

. . .. . .

. . . . . .

Find overlaps among samples

DNA

Page 85: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Mapping “shotgun” to our situation

many dynamic instructions

Icache miss

Dcache missBranch misp.No event

Page 86: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

. . .. . .

Profiler hardware requirements

Page 87: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

. . .. . .

Profiler hardware requirements

Match!

Page 88: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Page 89: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Modeling execution as a graph

2.1 % 6.0% 0.1 %

Page 90: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Modeling execution as a graph

2.1 % 6.0% 0.1 %

Errors in graph construction

5.3 % 1.5 % 1.6 %

Page 91: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Modeling execution as a graph

2.1 % 6.0% 0.1 %

Errors in graph construction

5.3 % 1.5 % 1.6 %

Sampling only a few graph fragments

4.8 % 6.5 % 7.2 %

Total 12.2 % 14.0 % 8.9 %

Page 92: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Conclusion: Bottleneck Analysis Applications

Run-time Optimization• Effective speculation

• Resource arbitration

• Dynamic reconfiguration

• Energy efficiency

Design Decisions• Overcoming technology constraints

Programmer Performance Tuning• Where have the cycles gone?

Selective value prediction

Scheduling and steering in clustered processors

Resize instruction window

Non-uniform machines

Helped cope with high-latency dcache

Measured cost of cache misses/branch

mispredicts

Page 93: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Conclusion: Grand Challenges

• Cost• How much speedup possible from optimizing

an event?

• Slack• How much can an event be “slowed down”

before increasing execution time?

• Interactions• When do multiple events need to be

optimized simultaneously?

• When do we have a choice?

modeling

token-passing analyzer

parallel interactions

serial interactions

shotgun profiling

Page 94: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Backup Slides

Page 95: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Related Work

Page 96: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Criticality Prior Work

Critical-Path Method, PERT charts• Developed for Navy’s “Polaris” project-1957

• Used as a project management tool

• Simple critical-path, slack concepts

“Attribution” Heuristics• Rosenblum et al.: SOSP-1995, and many others

• Marks instruction at head of ROB as critical, etc.

• Empirically, has limited accuracy

• Does not account for interactions between events

Page 97: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Related Work: Microprocessor Criticality

Latency tolerance analysis• Srinivasan and Lebeck: MICRO-1998

Heuristics-driven criticality predictors• Tune et al.: HPCA-2001• Srinivasan et al.: ISCA-2001

“Local” slack detector• Casmira and Grunwald: Kool Chips Workshop-

2000

ProfileMe with pair-wise sampling• Dean, et al.: MICRO-1997

Page 98: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Unresolved Issues

Page 99: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Alternative I: Addressing Unresolved Issues

Modeling and Measurement• What resources can we model effectively?

• difficulty with mutual-exclusion-type resouces (ALUs)

• Efficient algorithms

• Release tool for measuring cost/slack

Hardware • Detailed design for criticality analyzer

• Shotgun profiler simplifications• gradual path from counters

Optimization • explore heuristics for exploiting interactions

Page 100: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Alternative II: Chip-Multiprocessors

Design Decisions• Should each core support out-of-order execution?• Should SMT be supported?• How many processors are useful?• What is the effect of inter-processor latency?

Programmer Performance TuningParallelizing applications

• What makes a good division into threads?• How can we find them automatically, or at least help programmers to find them?

Page 101: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Unresolved issuesModeling and Measurement

• What resources can we model effectively?• difficulty with mutual-exclusion-type resouces (ALUs)

• In other words, unanticipated side effects

1

1

1. ld r2, [Mem]2. add r3 r2 + 13. ld r4, [Mem]4. add r6 r4 + 1

(cache miss)

(cache miss)

F

E

C

F

E

C

F

E

C

F

E

C

10 10

1

0

1 10 10

111

0 0

000

Original Execution

(cache miss)

(cache hit)Nocontention

1. ld r2, [Mem]2. add r3 r2 + 13. ld r4, [Mem]4. add r6 r4 + 1

F

E

C

F

E

C

F

E

C

F

E

C

10 2

1

0

10 1 12

1111

0 0

000

Altered Execution(to compute cost of inst #3

cache miss)

Adder contention

Contention edge

Incorrect critical path due to contention edge

Should not be here

Page 102: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Unresolved issues

Modeling and Measurement (cont.)

• How should processor policies be modeled?• relationship to icost definition

• Efficient algorithms for measuring icosts• pairs of events, etc.

• Release tool for measuring cost/slack

Page 103: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Unresolved issues

Hardware • Detailed design for criticality analyzer

• help to convince industry-types to build it

• Shotgun profiler simplifications• gradual path from counters

Optimization • Explore icost optimization heuristics

• icosts are difficult to interpret

Page 104: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Validation

Page 105: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Validation: can we trust our model?

Run two simulations :

• Reduce CP latencies

• Reduce non-CP latencies

Expect “big” speedup

Expect no speedup

Page 106: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Validation: can we trust our model?

0

0.2

0.4

0.6

0.8

1

crafty eon gcc gzip perl vortex galgel mesaSp

eed

up

per

Cyc

le R

edu

ced

Reducing CP Latencies

Reducing non-CP Latencies

Page 107: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Validation

Two steps:

1. Increase latencies of insts. by their apportioned slack

• for three apportioning strategies:1) latency+1,2) 5-cycles to as many instructions

as possible, 3) 12-cycles to as many loads as

possible

2. Compare to baseline (no delays inserted)

Page 108: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Validation

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

120%

ammp art gcc gzip mesa parser perl vortex average

Per

cent

of E

xecu

tion

Tim

e

baseline

latency + 1

12 cycles to loads

five cycles

Worst case: Inaccuracy of 0.6%

Page 109: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slack Measurements

Page 110: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Three slack variants

Local slack:# cycles latency can be increased

without delaying any subsequent instructions

Global slack:# cycles latency can be increased

without delaying the last instruction in the program

Apportioned slack:Distribute global slack among instructions

using an apportioning strategy

Page 111: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slack measurements

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Number of Cycles of Slack (perl)

Per

cent

of D

ynam

ic In

stru

ctio

ns

~21% insts have at least 5 cycles of local slack

local

Page 112: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slack measurements

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Number of Cycles of Slack (perl)

Per

cent

of D

ynam

ic In

stru

ctio

ns

~90% insts have at least 5 cycles of global slack

local

global

Page 113: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slack measurements

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Number of Cycles of Slack (perl)

Per

cent

of D

ynam

ic In

stru

ctio

ns

~80% insts have at least 5 cycles of apportioned

slack

local

apportioned

global

A large amount of exploitable slack exists

Page 114: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Application-centered Slack Measurements

Page 115: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Load slack

Can we tolerate a long-latency L1 hit?

design: wire-constrained machine, e.g. Grid

non-uniformity: multi-latency L1

apportioning strategy:apportion ALL slack to load

instructions

Page 116: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Apportion all slack to loads

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

Number of Cycles of Slack on Load Instructions

Per

cen

t of D

ynam

ic L

oad

s

gccperl

gzip

Most loads can tolerate an L2 cache hit

Page 117: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Multi-speed ALUs

Can we tolerate ALUs running at half frequency?

design: fast/slow ALUs

non-uniformity: multi-latency execution latency,

bypassapportioning strategy:

give slack equal to original latency + 1

Page 118: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Latency+1 apportioning

0%10%20%30%40%50%60%70%80%90%

100%

ammp art gcc gzip mesa parser perl vortex averagePerc

ent o

f Dyn

amic

Inst

ruct

ions

Most instructions can tolerate doubling their latency

Page 119: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slack Locality and Prediction

Page 120: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Predicting slack

Two steps to PC-indexed, history-based prediction:

1. Measure slack of a dynamic instruction2. Store in array indexed by PC of static instruction

Two requirements:

1. Locality of slack2. Ability to measure slack of a dynamic instruction

Page 121: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Locality of slack

0

10

20

30

40

50

60

70

80

90

100

ammp art gcc gzip mesa parser perl vortex average

Per

cen

t o

f (w

eig

hte

d)

stat

ic in

stru

ctio

ns

ideal

Page 122: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Locality of slack

0

10

20

30

40

50

60

70

80

90

100

ammp art gcc gzip mesa parser perl vortex average

Per

cen

t o

f (w

eig

hte

d)

stat

ic in

stru

ctio

ns

ideal

100%

Page 123: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Locality of slack

0

10

20

30

40

50

60

70

80

90

100

ammp art gcc gzip mesa parser perl vortex average

Per

cent

of (

wei

ghte

d) s

tatic

inst

ruct

ions

ideal

95%

100%

90%

PC-indexed, history-based predictor

can capture most of the available slack

Page 124: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slack Detector

Problem #2Determining if overall execution time increased

SolutionCheck if delay made instruction critical

delay and observe effective for hardware predictor

Problem #1Iterating repeatedly over same dynamic instruction

SolutionOnly sample dynamic instruction once

Page 125: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slack Detector

Goal: Determine whether instruction has n cycles of slack

1. Delay the instruction by n cycles2. Check if critical (via critical-path analyzer)

3. No, instruction has n cycles of slack 4. Yes, instruction does not have n cycles of slack

delay and observe

Page 126: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slack Application

Page 127: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Fast/slow cluster microarchitecture

Data Cache

WIN Reg

WIN Reg

Fast, 3-wide cluster

Slow, 3-wide cluster

ALUs

ALUs

Fetch + Rename

Aggressive non-uniform design:

• Higher execution latencies

• Increased (cross-domain) bypass latency

• Decreased effective issue bandwidth

Steer

Bypass Bus

P F2

save ~37% core power

Page 128: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Picking bins for the slack predictor

Use implicit slack predictor with four bins:

1. Steer to fast cluster + schedule with high priority2. Steer to fast cluster + schedule with low priority 3. Steer to slow cluster + schedule with high

priority4. Steer to slow cluster + schedule with low priority

Two decisions

1. Steer to fast/slow cluster

2. Schedule with high/low priority within a cluster

Page 129: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slack-based policies

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

ammp art gcc gzip mesa parser perl vortex average

No

rmal

ized

IPC

2 fast, high-power clustersslack-based

policyreg-dep steering

10% better performance from hiding non-uniformities

Page 130: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

CMP case study

Page 131: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Multithreaded Execution Case Study

Two questions:

• How should a program be divided into threads?• what makes a good cutpoint?

• how can we find them automatically, or at least help programmers find them?

• What should a multiple-core design look like?• should each core support out-of-order execution?

• should SMT be supported?

• how many processors are useful?

• what is the effect of inter-processor latency?

Page 132: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Parallelizing an application

Why parallelize a single-thread application?

• Legacy code, large code bases

• Difficult to parallelize apps• Interpreted code, kernels of operating systems

• Like to use better programming languages• Scheme, Java instead of C/C++

Page 133: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Parallelizing an application

Simplifying assumption• Program binary unchanged

Simplified problem statement• Given a program of length L, find a cutpoint that

divides the program into two threads that provides maximum speedup

Must consider:

• data dependences, execution latencies, control dependences, proper load balancing

Page 134: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Parallelizing an application

Naive solution:• try every possible cutpoint

Our solution:• efficiently determine the effect of every

possible cutpoint

• model execution before and after every cut

Page 135: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Solution

last instruction

F

E

C

first instruction

0 1 0 1 0 1 0

1

3

2 1

0 1

21

1

4

0

0

2

1 11

2

0 1 0

21

141 1

21

1

2

3

1

000 0

start

Page 136: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Parallelizing an application

Considerations:• Synchronization overhead

• add latency to EE edges

• Synchronization may involve turning EE to EF • Scheduling of threads

• additional CF edges

Challenges:• State behavior (one thread to multiple

processors)• caches, branch predictor

• Control behavior• limits where cutpoints can be made

Page 137: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Parallelizing an application

More general problem:• Divide a program into N threads

• NP-complete

Icost can help:• icost(p1,p2) << 0 implies p1 and p2 redundant

• action: move p1 and p2 further apart

Page 138: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Preliminary Results

Experimental Setup• Simulator, based loosely on SimpleScalar

• Alpha SpecInt binaries

Procedure1. Assume execution trace is known

2. Look at each 1k run of instructions

3. Test every possible cutpoint using 1k graphs

Page 139: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Dynamic Cutpoints

Cost Distribution of Dynamic Cutpoints

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100

Execution time reduction (cycles)

Cu

mu

lati

ve P

ct. o

f C

utp

oin

ts bzip

crafty

eon

gap

gcc

parser

perl

tw ol

vpr

Only 20% of cuts yield benefits of > 20 cycles

Page 140: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Usefulness of cost-based policy

Speedups from parallelizing programs for a two-processor system

0

5

10

15

20

25

30

bzip crafty eon gap gcc gzip mcf parser perl twolf vpr

Sp

ee

du

p %

fixed-interval

simple cost-based

Page 141: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Static Cutpoints

Cost Distribution of Static Cutpoints

0

10

20

30

40

50

60

70

80

90

100

0 20 40 60 80 100 120 140 160 180

Avg. per-dynamic-instance Cost of Static Instructions

Cu

mu

lati

ve P

ct. o

f In

stru

ctio

ns bzip

crafty

eon

gap

gcc

gzip

mcf

parser

perl

tw olf

vpr

Up to 60% of cuts yield benefits of > 20 cycles

Page 142: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Future Avenues of Research

• Map cutpoints back to actual code• Compare automatically generated cutpoints to

human-generated ones• See what performance gains are in a simulator, as

opposed to just on the graph

• Look at the effect of synchronization operations• What additional overhead do they introduce?

• Deal with state, control problems• Might need some technique outside of the graph

Page 143: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Multithreaded Execution Case Study

Two possible questions:

• How should a program be divided into threads?• what makes a good cutpoint?

• how can we find them automatically, or at least help programmers find them?

• What should a multiple-core design look like?• should each core support out-of-order execution?

• should SMT be supported?

• how many processors are useful?

• what is the effect of inter-processor latency?

Page 144: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

CMP design study

What we can do:

• Try out many configurations quickly• dramatic changes in architecture often only small

changes in graph

• Identifying bottlenecks• especially interactions

Page 145: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

CMP design study: Out-of-orderness

Is out-of-order execution necessary in a CMP?

Procedure• model execution with different configurations

• adjust CD edges

• compute breakdowns• notice resource/events interacting with CD edges

Page 146: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

CMP design study: Out-of-orderness

last instruction

F

E

C

first instruction

0 1 0 1 0 1 0

1

3

2 1

0 1

21

1

4

0

0

2

1 11

2

0 1 0

21

141 1

21

1

2

3

1

000 0

Page 147: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

CMP design study: Out-of-orderness

Results summary• Single-core: Performance taps out at 256 entries• CMP: Performance gains up through 1024 entries

• some benchmarks see gains up to 16k entries

Why more beneficial?• Use breakdowns to find out.....

Page 148: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

CMP design study: Out-of-orderness

Components of window cost• cache misses holding up retirement?• long strands of data dependencies?• predictable control flow?

Icost breakdowns give quantitative and qualitative answers

Page 149: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

CMP design study: Out-of-orderness

cost(window) + icost(window, A) + icost(window, B) + icost(window, AB) = 0

window cost

100%

0%

ALU

cachemisses

Independent

ALU

cachemisses

interaction

Parallel Interaction

ALU

cachemisses

interaction

Serial Interaction

equal

Page 150: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Summary of Preliminary Results

icost(window, ALU operations) << 0• primarily communication between processors

• window often stalled waiting for data

Implications• larger window may be overkill

• need a cheap non-blocking solution• e.g., continual-flow pipelines

Page 151: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

CMP design study: SMT?

Benefits• reduced thread start-up latency

• reduced communication costs

How we could help• distribution of thread lengths

• breakdowns to understand effect of communication

Page 152: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

#1

#2

#1

Start #1

#2

CMP design study: How many processors?

Page 153: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

CMP design study: Other Questions

What is the effect of inter-processor communication latency?• understand hidden vs. exposed communication

Allocating processors to programs• methodology for O/S to better assign programs

to processors

Page 154: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Waterfall To Graph Story

Page 155: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

F E C

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Standard Waterfall Diagram

Page 156: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

F E C

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Annotated with Dependence Edges

Page 157: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

F E C

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

Fetch BW

Data Dep

ROB

Branch Misp.

Annotated with Dependence Edges

Page 158: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

R5 = 0 F E C

R3 = 0 F E C

R1 = #array + R3

F E C

R6 = ld[R1] F E C

R3 = R3 + 1 F E C

R5 = R6 + R5 F E C

cmp R6, 0 F E C

bf L1 F E C

R5 = R5 + 100 F E C

R0 = R5 F E C

Ret R0 F E C

1

1

1

1

11

3

1 1

2

1

0

1

Edge Weights Added

Page 159: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

bf L1

R5 = R5 + 100

R0 = R5

Ret R0

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

1

1

1

1

1

1

2

1

11

1

3

0

1

1

2

1 1

11

1

1

1

1

11

1

1

2

2

0

0

0

0

Convert to Graph

Page 160: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

bf L1

R5 = R5 + 100

R0 = R5

Ret R0

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

1

1

1

1

1

1

2

1

11

1

3

0

1

1

2

1 1

11

1

1

1

1

11

1

1

2

2

0

0

0

0

Find Critical Path

Page 161: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

bf L1

R5 = R5 + 100

R0 = R5

Ret R0

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

1

1

1

1

1

1

2 1

11

1

3

0

1

1 1

11

1

1

1

1

1

11

1

1

2

2 2

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

Add Non-last-arriving Edges

Page 162: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

R5 = 0

R3 = 0

R1 = #array + R3

R6 = ld[R1]

R3 = R3 + 1

R5 = R6 + R5

cmp R6, 0

bf L1

R5 = R5 + 100

R0 = R5

Ret R0

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

F E C

1

1

1

1

1

1

2 1

11

1

0

1

1 1

11

1

1

1

1

1

11

1

1

2

2 2

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

Branch misprediction made correct

Graph Alterations

Page 163: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Token-passing analyzer

Page 164: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Step 1. Observing

Observation: R1 R2 + R3

If dependence into R2 is on critical path, then value of R2 arrived last.

critical arrives last

arrives last critical

E

R2

R3

Dependence resolved early

Page 165: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Determining last-arrive edges

Observe events within the machine

last_arrive[F] =

last_arrive[E] =

E

F

CC

E

F

CC

FE if data ready on fetch

E

F

CC

E

F

CC

E

F

CC

EE observe arrival order of operands

E

F

CC

E

F

CC

last_arrive[C] =

EC if commit pointer is delayed

CC otherwise

E

F

CC

E

F

CC

E

F

CC

E

F

CC

E

F

CC

E

F

CC

EF if branch misp.

E

F

CC

E

F

CC

E

F

CC

E

F

CC

CF if ROB stall

FF otherwise

Page 166: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Last-arrive edges: a CPU stethoscope

CPU

E C

E E F E C F

F F

E F

C C

Page 167: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Last-arrive edges

F

E

C

0 1 0 1 0 1 0

1

3

21

0 1

21

1

4

0

0

2

1 11

2

0 1 0

21

141

1

21

1

2

3

1

00 0 0

Page 168: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Remove latencies

F

E

C

Do not need explicit weights

Page 169: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Last-arrive edges

The last-arrive rule

CP consists only of “last-arrive” edges

F

E

C

Page 170: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Prune the graph

Only need to put last-arrive edges in graphNo other edges could be on CP

F

E

C

newest

Page 171: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

…and we’ve found the critical path!

Backward propagate along last-arrive edges

newest

F

E

C

newest Found CP by only observing last-arrive

edges but still requires constructing entire

graph

Page 172: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Step 2. Efficient analysis

CP is a ”long” chain of last-arrive edges. the longer a given chain of last-arrive

edges, the more likely it is part of the CP

Algorithm: find sufficiently long last-arrive chains

1. Plant token into a node n

2. Propagate forward, only along last-arrive edges

3. Check for token after several hundred cycles

4. If token alive, n is assumed critical

Page 173: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

1. plant token

Token-passing example

2. propagate token

3. is token alive?

4. yes, train critical

Critical

Found CP without constructing entire graph

ROB Size

Page 174: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Implementation: a small SRAM array

Last-arrive producer node (inst id, type)

Token Queue

Read

Wri

te

Commited (inst id, type)

Size of SRAM: 3 bits ROB size < 200 Bytes

Simply replicate for additional tokens

Page 175: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Putting it all together

CP prediction

table

Last-arrive edges

(producer retired instr)

OOO CoreE-critical?

Training Path

PC

Prediction Path

Token-PassingAnalyzer

Page 176: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Scheduling and Steering

Page 177: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Case Study #1: Clustered architectures

steering

issue window

scheduling1. Current state of art

(Base)2. Base + CP

Scheduling3. Base + CP Scheduling + CP Steering

Page 178: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

0.60

0.70

0.80

0.90

1.00

1.10

No

rma

lize

d I

PC

eoncrafty gcc gzip perl vortex galgel mesa

unclustered

2 cluster

4 cluster

Current State of the Art

Avg. clustering penalty for 4 clusters: 19%

Constant issue width, clock frequency

Page 179: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

0.60

0.70

0.80

0.90

1.00

1.10

No

rma

lize

d I

PC

eoncrafty gcc gzip perl vortex galgel mesa

unclustered

2 cluster

4 cluster

CP Optimizations

Base + CP Scheduling

Page 180: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

0.60

0.70

0.80

0.90

1.00

1.10

No

rma

lize

d I

PC

eoncrafty gcc gzip perl vortex galgel mesa

unclustered

2 cluster

4 cluster

CP Optimizations

Avg. clustering penalty reduced from 19% to 6%

Base + CP Scheduling + CP Steering

Page 181: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Token-passing Vs. Heuristics

Page 182: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Local Vs. Global Analysis

-5.0%

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

crafty eon gcc gzip perl vortex galgel mesa

Sp

eed

up

oldest-uncommited

oldest-unissued

token-passing

Previous CP predictors:local resource-sensitive predictions (HPCA 01, ISCA

01)

CP exploitation seems to require global analysis

Page 183: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost case study

Page 184: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

Deep pipelines cause long latency loops:• level-one (DL1) cache access,

issue-wakeup, branch misprediction, …

But can often mitigate them indirectlyAssume 4-cycle DL1 access; how to mitigate?

Increase cache ports? Increase window size?

Increase fetch BW? Reduce cache misses?

Really, looking for serial interactions!

Page 185: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 186: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 187: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 188: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 189: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 190: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 191: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Case Study: Deep pipelines

E E EE E

F F FF F

C C CC C

E

F

C

5

6

5

9 18 7 6 7

5555

1 12 1 0 12

01010

14

2

1

i1 i2 i3 i4 i5 i6

4

4

DL1 access

window edge

Page 192: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Breakdown (6 wide, 64-entry window)

gcc gzip vortex

DL1

DL1+window

DL1+bw

DL1+bmisp

DL1+dmiss

DL1+alu

DL1+imiss

...

Total

Page 193: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Breakdown (6 wide, 64-entry window)

gcc gzip vortex

DL1 30.5 %

DL1+window

DL1+bw

DL1+bmisp

DL1+dmiss

DL1+alu

DL1+imiss

...

Total

Page 194: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Breakdown (6 wide, 64-entry window)

gcc gzip vortex

DL1 30.5 %

DL1+window -15.3

DL1+bw 6.0

DL1+bmisp -3.4

DL1+dmiss -0.4

DL1+alu -8.2

DL1+imiss 0.0

... ...

Total 100.0

Page 195: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost Breakdown (6 wide, 64-entry window)

gcc gzip vortex

DL1 18.3 % 30.5 % 25.8 %

DL1+window -4.2 -15.3 -24.5

DL1+bw 10.0 6.0 15.5

DL1+bmisp -7.0 -3.4 -0.3

DL1+dmiss -1.4 -0.4 -1.4

DL1+alu -1.6 -8.2 -4.7

DL1+imiss 0.1 0.0 0.4

... ... ... ...

Total 100.0 100.0 100.0

Page 196: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Vortex Breakdowns, enlarging the window

64 128 256

DL1

DL1+window

DL1+bw

DL1+bmisp

DL1+dmiss

DL1+alu

DL1+imiss

...

Total

Page 197: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Vortex Breakdowns, enlarging the window

64 128 256

DL1 25.8 8.9 3.9

DL1+window

-24.5 -7.7 -2.6

DL1+bw 15.5 16.7 13.2

DL1+bmisp -0.3 -0.6 -0.8

DL1+dmiss -1.4 -2.1 -2.8

DL1+alu -4.7 -2.5 -0.4

DL1+imiss 0.4 0.5 0.3

... ... ... ...

Total 100.0 80.8 75.0

Page 198: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Shotgun Profiling

Page 199: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Profiling goal

Goal: • Construct graph

many dynamic instructions

Constraint:• Can only sample sparsely

Page 200: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Profiling goal

Goal: • Construct graph

Constraint:• Can only sample sparsely

DNA

DNA strand

Genome sequencing

Page 201: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

DNA

Page 202: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

DNA

Page 203: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

. . .. . .

DNA

Page 204: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

“Shotgun” genome sequencing

. . .. . .

. . . . . .

Find overlaps among samples

DNA

Page 205: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Mapping “shotgun” to our situation

many dynamic instructions

Icache miss

Dcache missBranch misp.No event

Page 206: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

. . .. . .

Profiler hardware requirements

Page 207: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

. . .. . .

Profiler hardware requirements

Match!

Page 208: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Offline Profiler Algorithm

long sample

detailed samples

Page 209: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

=then

=if

Design issues

Identify microexecution context

• Choosing signature bits

• Determining PCs (for better detailed sample matching) long

sampleStart PC121620245660 . . .

branchencode taken/not-taken bit in signature

Page 210: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Page 211: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Building graph fragments

Page 212: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Building graph fragments

Sampling only a few graph fragments

Page 213: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Building graph fragments

Sampling only a few graph fragments

Modeling execution as a graph

Page 214: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Building graph fragments

5.3 % 1.5 % 1.6 %

Sampling only a few graph fragments

Modeling execution as a graph

Page 215: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Building graph fragments

5.3 % 1.5 % 1.6 %

Sampling only a few graph fragments

4.8 % 6.5 % 7.2 %

Modeling execution as a graph

Page 216: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Building graph fragments

5.3 % 1.5 % 1.6 %

Sampling only a few graph fragments

4.8 % 6.5 % 7.2 %

Modeling execution as a graph

2.1 % 6.0% 0.1 %

Page 217: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Sources of error

Error Source Gcc Parser Twolf

Building graph fragments

5.3 % 1.5 % 1.6 %

Sampling only a few graph fragments

4.8 % 6.5 % 7.2 %

Modeling execution as a graph

2.1 % 6.0% 0.1 %

Total 12.2 % 14.0 % 8.9 %

Page 218: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Icost vs. Sensitivity Study

Page 219: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Compare Icost and Sensitivity Study

Corollary to DL1 and ROB serial interaction:As load latency increases, the benefit from enlarging the ROB increases.

E E EE E

F F FF F

C C CC C

E

F

C

1

2

1

1 2 3 2 3

1111

0 1 0 1 1

01010

2

2

1

i1 i2 i3 i4 i5 i6

4

3

DL1 access

Page 220: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Compare Icost and Sensitivity Study

0

5

10

15

20

25

64 128 192 256

ROB size

Sp

eed

up 10

54321

DL1 Latency

Page 221: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Compare Icost and Sensitivity Study

Sensitivity Study Advantages• More information

• e.g., concave or convex curves

Interaction Cost Advantages• Easy (automatic) interpretation

• Sign and magnitude have well defined meanings

• Concise communication• DL1 and ROB interact serially

Page 222: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Outline

• Definition (ISCA ’01)

• what does it mean for an event to be critical?

• Detection (ISCA ’01)

• how can we determine what events are critical?

• Interpretation (MICRO ’04, TACO ’04)

• what does it mean for two events to interact?

• Application (ISCA ’01-’02, TACO ’04)

• how can we exploit criticality in hardware?

Page 223: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Our solution: measure interactions

Two parallel cache misses (Each 100 cycles)

miss #1 (100)miss #2 (100)

Cost(miss #1) = 0

Cost(miss #2) = 0

Cost({miss #1, miss #2}) = 100

Aggregate cost > Sum of individual costs Parallel interaction100 0 +

0icost = aggregate cost – sum of individual costs

= 100 – 0 – 0 = 100

Page 224: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

2. Zero icost ?

1. Positive icost parallel

interaction

miss #1

miss #2

Page 225: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

miss #1

miss #21. Positive icost

parallel interaction

2. Zero icost independent

miss #1 miss #2

. . .

3. Negative icost ?

Page 226: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Negative icost

Two serial cache misses (data dependent)

miss #1 (100)

miss #2 (100)

Cost(miss #1) = ?

ALU latency (110 cycles)

Page 227: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Negative icost

Two serial cache misses (data dependent)

Cost(miss #1) = 90

Cost(miss #2) = 90

Cost({miss #1, miss #2}) = 90

ALU latency (110 cycles)

miss #1 (100)

miss #2 (100)

icost = aggregate cost – sum of individual costs

= 90 – 90 – 90 = -90Negative icost serial interaction

Page 228: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Interaction cost (icost)

icost = aggregate cost – sum of individual costs

miss #1

miss #21. Positive icost

parallel interaction

2. Zero icost independent

miss #1 miss #2. . .

3. Negative icost serial

interaction

ALU latency

miss #1 miss #2

Branch mispredict

Fetch BW

Load-Replay Trap

LSQ stall

Page 229: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Why care about serial interactions?

ALU latency (110 cycles)

miss #1 (100)

miss #2 (100)

Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)

Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1

Page 230: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Outline

• Definition (ISCA ’01)

• what does it mean for an event to be critical?

• Detection (ISCA ’01)

• how can we determine what events are critical?

• Interpretation (MICRO ’04, TACO ’04)

• what does it mean for two events to interact?

• Application (ISCA ’01-’02, TACO ’04)

• how can we exploit criticality in hardware?

Page 231: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Criticality Analyzer (ISCA ‘01)

Procedure

1. Observe last-arriving edges

• uses simple rules

2. Propagate a token forward along last-arriving edges

• at worst, a read-modify-write sequence to a small array

3. If token dies, non-critical; otherwise, critical

Goal

• Detect criticality of dynamic instructions

Page 232: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Slack Analyzer (ISCA ‘02)

Goal

• Detect likely slack of static instructions

Procedure

1. Delay the instruction by n cycles2. Check if critical (via critical-path analyzer)

• No, instruction has n cycles of slack • Yes, instruction does not have n cycles of

slack

Page 233: Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)

Shotgun Profiling (TACO ‘04)

Goal

• Create representative graph fragments

Procedure

• Enhance ProfileMe counters with context

• Use context to piece together counter samples