exploiting criticality to reduce bottlenecks in distributed uniprocessors behnam robatmili and sibi...

46
Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger Microsoft Research Stephen W. Keckler Architecture Research Group, NVIDIA & University of Texas at Austin

Upload: rodney-mckenzie

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

Exploiting Criticality to Reduce Bottlenecks in

Distributed Uniprocessors Behnam Robatmili and Sibi Govindan

University of Texas at Austin

Doug BurgerMicrosoft Research

Stephen W. KecklerArchitecture Research Group, NVIDIA & University of Texas at

Austin

Page 2: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

2

Motivation

Do we still care about single thread execution?

Running each single thread faster and more power effectively by using

multiple cores:

1. increases parallel systems efficiency

2. lessens the needs for heterogeneity and its software complexity!

Page 3: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

3

Summary

Distributed uniprocessors: multiple cores sharing resources to run a thread across

Scalable complexity but cross-core delay overheads

Performance scalability overheads? Registers, memory, fetch, branches, etc?!

Measure critical cross-core delays using profile-based critical path analysis

Low-overhead distributed mechanisms to mitigate these bottlenecks

Page 4: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

4

Distributed Uniprocessors• Partition single-thread instruction stream across cores• Distributed resources (RF, BP and L1) act like a large

processor• Inter-core instruction, data and control communication• Goal: Reduce these overheads

RF

BP

L1

RF

BP

L1RF

BP

L1

RF

BP

L1RF

BP

L1

Inter-core data communication

Inter-core control communication

Linear Complexity

Page 5: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

5

Example Distributed Uniprocessors

CoreFusion TFlex

ISA x86 EDGE

Instruction partitioning Dynamic: centralized register management unit (RMU)

Static: compiler-generated predicated dataflow blocks

Fetch and control dependences

Dynamic: centralized fetch management unit (FMU)

Dynamic: next block prediction (no intrablock control flow)

Cross-core instruction communication

Dynamic: centralized RMU

Dynamic: distributed register RW queues

Scalability 4 2-w cores 8 2-w cores

This study uses TFlex as the underlying distributed uniprocessor

Older designs: Multiscalar and TLS use a noncontiguous instruction window

Recent designs: CoreFusion, TFlex, WiDGET and Forwardflow

Page 6: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

6

TFlex Distributed Uniprocessor

C C C C

C C C C

C C C C

C C C C

C C C C

C C C C

C C C C

C C C C

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

T0 T1

T2 T3

T4 T5

T6

T7

Maps one predicated data-flow block to each core Blocks communicate across registers (via register home

cores) Example: B2 on C2 communicates to B3 on C3 through R1 on C1

Intra-block communication is all dataflow

32 physical cores

8 logical processors (threads)

L1

B2 Reg

[R2]

L1

Reg

[R0]

L1

Reg

[R1]

L1

Reg

[R3]

B1B0

B3

Intra-block

IQ-local

communicationInter-block

cross-core

communication

Control

dependences

C0 C1

C2 C3

Page 7: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

7

Profile-based Critical Path Bottleneck Analysis

Using critical path analysis to quantify

scalable resources and bottlenecks

SPEC INT

Fetch bottleneck caused by mispredicted blocks Register communication overhead

One of the scalable resources

network

Re

al w

ork

Page 8: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

Distributed Criticality Analyzer

FetchDecode

DecodeMerge

Issue Execute

RegWrite

RegWriteBypass

Commit

Communication-criticality inst type

outp

ut

criti

cal

inpu

t

criti

cal

Fetch-critical block reissue

pred_input Criticality Predictor

Block Reissue Engine

i_counter

available_blocks_bitpattern

An entry in block criticality status table

Requested block PC

Predicted comm-critical instspred_output o_counterRequested block PC

Core selected for running fetch-

critical block

Co

ord

ina

tor

com

po

ne

nts

Exe

cutin

g c

ore

8

• A statically-selected coordinator core is assigned to each region of the code executing on a core– Each coordinator core holds and maintains criticality data

for the regions assigned to it– Sends criticality data to executing core when the region is

fetched– Enables register bypassing, dynamic merging, block

reissue, etc.

Page 9: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

9

Register Bypassing

B2

L1

1 Reg

[R0]

B0

L1

Reg

[R2]

B1

L1

Reg

[R3]

B3

L1

Reg

[R1]

Intra-block

IQ-local

communicationInter-block

cross-core

communication

1

2

2

Sample Execution: Block B2 communicating to B3 through register path

1 & 2 (2 is slow)

Output criticalInput critical

Last departing

Last arriving

2

2

Coordinator Core 0 predicts late communication instructions B21

& B31

(only

path 2 is predicted)

Bypassing critical register values on the critical path

Register

bypassing

1

2

C0 C1

C2 C3

Coordination

signals

Page 10: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

10

Optimization Mechanisms

• Output criticality: Register bypassing– Explained in previous page (saves

delay)• Input criticality: Dynamic merging

– Decode time dependence height reduction for critical input chains (saves delay)

• Fetch criticality: Block reissue– Reissuing critical instructions

following pipeline flushes (saves energy & delay by reducing fetches by about 40%)

Page 11: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

11

Aggregate Performance

16-core individual and aggregate results

168.wupwise

171.swim

172.mgrid

177.mesa

179.art

183.equake

188.am

mp

301.apsi

164.gzip

175.vpr

181.mcf

186.cra

fty

197.pars

er

253.perlb

mk

256.bzip

2

300.twolf

fp a

ve

int a

ve a

ve0.951.001.051.101.151.201.251.301.351.40 bypass merge breissue aggregate

Optimization mechanism

Page 12: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

Final Critical Path Analysis

SPEC INT

12

improved

distribution

16 base 16 optimized8 base 8 optimized1 base

network

Page 13: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

13

Performance Scalability Results

SPEC FP

Sp

ee

du

p o

ver

sin

gle

du

al-

issu

e c

ore

s

SPEC INT

1 2 4 8 161

1.5

2

2.5

3

3.5

4

4.5

5

5.5

baseline Pollacks

1 2 4 8 161

1.5

2

2.5

3

1 2 4 8 161

1.5

2

2.5

3

3.5

4

4.5

5

5.5

bypass baseline Pollacks

1 2 4 8 161

1.5

2

2.5

3

1 2 4 8 161

1.5

2

2.5

3

3.5

4

4.5

5

5.5

bypass_merge bypass baseline Pollacks

1 2 4 8 161

1.5

2

2.5

3

1 2 4 8 161

1.5

2

2.5

3

3.5

4

4.5

5

5.5

bypass_merge_breissue bypass_merge bypass baselinePollacks

1 2 4 8 161

1.5

2

2.5

3

# of cores

16-core INT: 22% speedup

Follows Pollack’s rule by up to 8 cores

# of cores

INT FP

Page 14: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

14

Energy Delay Square Product

8-core INT: 50% increase in ED2

Energy efficient configuration changes from 4 to 8-core

65nm, 1.0v, 1GHz

Page 15: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

15

Conclusions and Future Work

• Goal: A power/performance scalable distributed uniprocessor

• This work addressed several key performance scalability limitations

• Next steps (4x speedup om SPEC INT):

Overhead How to address Status

Low-accuracy next block prediction

OGEHL-based integrated branch and predicate predictor (IPP)

submitted

Branches converted to predicates

OGEHL-based integrated branch and predicate predictor (IPP)

submitted

Dataflow fanout delay and power overhead

Low-power compiler-exposed operand broadcasts (EOBs)

submitted

Icache utilization Variable block sizes MSR E2

Page 16: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

Questions?

Page 17: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

17

Backup Slides

• Setup and Benchmarks• CPA Example• Single Core IPCs• Communication Criticality Example• Fetch Criticality Example• Full Performance Results• Criticality Predictor• Motivation

Page 18: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

18

Backup Slides

Page 19: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

19

Summary

Do we still care about single thread execution?

Running each single thread effectively across multiple cores significantly

increases parallel systems efficiency and lessens the needs for heterogeneity

and its software complexity!

Distributed uniprocessors: multiple cores can share their resources to run a thread across

Scalable complexity but cross-core delay overheads

What are the overheads that limit performance scalability? Registers, memory, fetch, branches,

etc?!

We measure critical cross-core delays using static critical path analysis and find ways to hide

them

Major detected bottlenecks: cross-core register communication and fetches on flushes

We propose low-overhead distributed mechanisms to mitigate these bottlenecks

Page 20: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

Motivation• Need for scaling single-thread

performance/power in multicore– Amdahl’s law– Optimized power/performance for each

thread• Distributed Uniprocessors

– Running single-thread code across distributed cores

– Sharing resources but also partitioning overhead

• Focus of this work– Static critical path analysis to quantify

bottlenecks– Dynamic hardware to reduce critical cross-

core latencies20

Page 21: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

21

Distributed Uniprocessors

• Partition single-thread instruction stream across cores

• Distributed resources (RF, BP and L1) act like a large processor

RF

BP

L1

RF

BP

L1RF

BP

L1

RF

BP

L1RF

BP

L1

Page 22: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

22

Exploiting Communication Criticality

L1

B0Reg

[R0]

L1

Reg

[R2]

L1

Reg

[R3]

L1

Reg

[R1]

B3B2

B1

Intra-block

IQ-local

communicationInter-block

cross-core

communication

fanout

Sample Execution:

Block B0 communicating to B1 through B2

Output criticalInput critical

Last departing

Last arriving

Predicting critical instructions in blocks B0 and B1Forwarding critical register values

Register

forwarded

Broadcast

message

Replacing fanout for critical input with broadcast messages

Page 23: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

23

Dynamic Merging Results

cfactor: No. of predicted late inputs per block

full merge: Running the alg on all reg inputs 16-core runs

Sp

eed

up o

ver

no

mer

gin

g

1.00

1.05

1.10

1.15

1.20merge cfactor 1 merge cfactor 2 merge cfactor 3 full merge

65% of max using cfactor of 1

Page 24: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

24

Block Reissue Results

1x IQ 2x IQ 4x IQ 8x IQ

46% 57% 65% 71%

168.

wupwise

171.

swim

172.

mgr

id

177.

mes

a

179.

art

183.

equa

ke

188.

amm

p

301.

apsi

164.

gzip

175.

vpr

181.

mcf

186.

craf

ty

197.

pars

er

253.

perlb

mk

256.

bzip

2

300.

twol

f

fp a

ve

int a

ve a

ve0.96

1.00

1.04

1.08

1.12

1.16

1.20 1 2 4 8

Spee

dup

over

no

blk-

reiss

ue

x IQ

Block hit rates

Affected by dep. prediction

16-core runs

Page 25: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

25

Critical Path Bottleneck Analysis

Using critical path analysis to quantify

scalable resources and bottlenecks

SPEC INT

Fetch bottleneck caused by mispredicted blocksRegister communication overhead

One of the scalable resources

Page 26: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

26

Performance Scalability Results

# of cores# of cores

1 2 4 8 161

1.5

2

2.5

3

3.5

4

4.5

5

5.5

bypass_merge_breissue bypass_merge bypass baseline

16-core INT: 22% speedup

Follows Polluck’s rule by up to 8 cores

SPEC FP

1 2 4 8 161

1.5

2

2.5

3

Sp

ee

du

p o

ver

sin

gle

du

al-

issu

e c

ore

s

SPEC INT

Page 27: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

27

Block Reissue

• Each core maintains a table of available blocks and the status of their cores

• Done by extending alloc/commit protocols

• Policies– Block Lookup: Previously executed copies of

the predicted block should be spotted– Block Replacement: Refetch if the predicted

block is not spotted in any core• Major power saving on fetch/decode

Page 28: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

C C

C C

C C C C

C C C C

TFlex Cores

P

P L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

P

P P

P

1 cycle latency

28

• Each core has (shared when fused)– 1-ported cache bank (LSQ), 1-ported reg banks

(RWQ)– 128-entry RAM-based IQ, a branch prediction

table• When fused

– Registers, memory location and BP tables are stripped across cores

Inst

Queue

Reg

File

L1

Cache

RW

QLS

Q

BPred

Courtesy of Katie Coons for the figure

Page 29: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

C C

C C

C C C C

C C C C

TFlex Cores

P

P L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

P

P P

P

1 cycle latency

29

• Each core has minimum resources for one block– 1-ported cache bank, 1-ported reg bank (128

regs)– 128-entry RAM-based IQ, a branch prediction

table– RWQ and LSQ holds the transient arch states

during execution and commits the states at commit time

– LSQ supports memory dependence prediction

Inst

Queue

Reg

File

L1

Cache

RW

QLS

Q

BPred

Courtesy of Katie Coons for the figure

Page 30: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

30

Critical Output Bypassing

• Bypass late outputs to their destination instructions directly – Similar to memory bypassing and cloaking

[Sohi ‘99] but no speculation needed– Using predicted late outputs– Restricted between subsequent blocks

168.

wupwise

171.

swim

172.

mgr

id

177.

mes

a

179.

art

183.

equa

ke

188.

amm

p

301.

apsi

164.

gzip

175.

vpr

181.

mcf

186.

craf

ty

197.

pars

er

253.

perlb

mk

256.

bzip

2

300.

twol

f

fp a

ve

int a

ve a

ve0

102030405060708090

100

> 3321

% o

f int

er-c

ore

regi

ster

dat

a tr

ansf

ers

Page 31: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

31

Simulation SetupParameter Setup

iCache Partitioned 8KB (1-cycle hit)

Branch predictor Local/Gshare Tournament predictor (8K+256 bits, 3 cycle latency)

Single core Out of order, RAM structured 128-entry issue window, dual-issue (up to two INT and one FP) or single issue

L1 cache Partitioned 8KB (2-cycle hit, 2-way set-associative, 1-read port and 1-write port), 44-entry LSQ banks

L2 and memory S-NUCA L2 cache L2-hit latency varies from 5 cycles to 27 cycles; average main memory latency is 150 cycles

Benchmark type Names

8 SPEC FP wupwise, swim, mgrid, mesa, art, equake, ammp, apsi

8 SPEC INT Gzip, vpr, mcf, crafty, parser, perl, bzip, twolf

Page 32: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

32

Predicting Critical Instructions

• State-of-the-art predictor [Fields ‘01]

– High communication and power overheads

– Large storage overhead– Complex token-passing hardware

• More complicated be ported to a dynamic CMP

• Need a simple, low-overhead while efficient predictor

Page 33: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

33

Proposed Mechanisms

33

• Cross-core register communication• Dataflow software fanout trees• Expensive refill after pipeline

flushes• Fixed block sizes• Poor next block prediction

accuracy• Predicates not being predicated

Register forwarding

Dynamic instruction merging

Block Reissue

Page 34: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

Critical Path Analysis

• Processes program dependence graph [Bodic ‘01]– Nodes: uarch events– Edges: data and uarch dep.s– Measure contribution of each

uarch resource• More effective than

simulation or profile-based techniques

• Built on top of [Nagarajan ‘06]

34

Simulator

Event Interface

Critical Path Analysis Tool

Page 35: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

35

Block Reissue Hit rates

168.

wupwise

171.

swim

172.

mgr

id

177.

mes

a

179.

art

183.

equa

ke

188.

amm

p

301.

apsi

164.

gzip

175.

vpr

181.

mcf

186.

craf

ty

197.

pars

er

253.

perlb

mk

256.

bzip

2

300.

twol

f

fp a

ve

int a

ve a

ve0

20

40

60

80

100 1 2 4 8

Bloc

k re

issue

d hi

t rat

e

Page 36: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

36

IPC of Single TFlex One 2w Core

• SPEC INT, IPC = 0.8• SPEC FP, IPC = 0.9

Page 37: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

37

Speculation Aware

cf =1

SPEC INT

Page 38: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

38

Critical Path Analysis

• Critical path: Longest dependence path during program execution– Determines execution time

• Critical path analysis [Bodic ‘01]– Measure contribution of each uArch

resource on critical cycles• Built on top of TRIPS CPA [Nagarajan ‘06]

38

Page 39: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

39

Exploiting Fetch Criticality

B0

L1

Reg

L1

Reg

B1

L1

Reg

B0

L1

Reg

Predicted fetched blocks: B0, B1, B0, B0

Actual block order: B0, B0, B0, B0

B0

B0

B0

B0B0

B1

B0

B0

Cross-core block

control order

Without using block reissue all 3 blocks will be flushedWith block reissue: Coordinator core (C0) detects B0 instances on C2-3 and

reissues them

C0 C1

C2 C3

Coordination signals

50% reduction in fetch and decode operations

B0

B1

CFGB0 B0

Reissued blocks

Refetched blocks

Fetched blocks

Page 40: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

40

Full Performance Comparison

Page 41: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

41

Full Energy Comparison

Page 42: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

Communication Criticality Predictor

• Block-atomic execution Late inputs and outputs are critical– Last outputs/inputs departing/arriving

before block commit• 70% and 50% of late inputs/outputs are

critical for SPEC INT and FP• Extend next block predictor protocol

– MJRTY algorithm [Moore ‘82] to predict/train– Increment/decrement a confident counter

upon correct/incorrect prediction of current majority

42

Page 43: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

43

Exploiting Communication Criticality

• Selective register forwarding– Critical register outputs are forwarded

to subsequent cores– Others outputs use original indirect

register forwarding using RWQs• Selective instruction merging

– Specialize decode of instructions dependent on critical register input

– Eliminates Dataflow fanout moves in address computation networks

Page 44: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

44

Exploiting Fetch Criticality

• Blocks after mispredictions are critical

• Many flushed blocks may be re-fetched right after a misprediction

• Blocks are predicated so old blocks can be reissued if their cores are free– Each owner core keeps track of its

blocks– Extended allocate/commit protocols

• Major power saving on fetch/decode

Page 45: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

45

Exploiting Communication Criticality

L1

1B2

Reg

[R0]

L1

Reg

[R2]

L1

Reg

[R3]

L1

Reg

[R1]

B2B0

B3

Intra-block

IQ-local

communicationInter-block

cross-core

communication

1

2

2

Sample Execution: Block B2 communicating to B3 through register path

1 & 2 (2 is slow)

Output criticalInput critical

Last departing

Last arriving

Coordinator Core 0 predicts late communication instructions B21

& B31

(only

path 2 is predicted)

Fast forwarding critical register values on the critical path

Register

bypassing

1

2

C0 C1

C2 C2

Coordination

signals

Page 46: Exploiting Criticality to Reduce Bottlenecks in Distributed Uniprocessors Behnam Robatmili and Sibi Govindan University of Texas at Austin Doug Burger

46

Summary

Do we still care about single thread execution?

Running each single thread effectively across multiple cores significantly

increases parallel systems efficiency and lessens the needs for heterogeneity

and its software complexity!

Distributed uniprocessors: multiple cores can share their resources to run a thread across

Scalable complexity but cross-core delay overheads

What are the overheads that limit performance scalability? Registers, memory, fetch, branches,

etc?!

We measure critical cross-core delays using static critical path analysis and find ways to hide

them

We propose low-overhead distributed mechanisms to mitigate these bottlenecks