Download - University of Illinois at Urbana-Champaign ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_asplos14.pdf · Slide 14 Number of Reordered Accesses 0 0.025 0.05 0.075 0.1 0.125 0.15

RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors

Nima Honarmand and Josep Torrellas

University of Illinois at Urbana-Champaign

http://iacoma.cs.uiuc.edu/

1

ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 2

RnR: Record and Deterministic Replay

• Recreate execution of a

(parallel) program

− Record: capture non-

deterministic events in a

log

− Replay: use the log to

recreate the exact same

execution

• Use cases: Debugging,

Security, Fault Tolerance, …

Sources of non-determinism

• Program inputs

− Capture using operating system support

• Memory access interleaving

− Memory Race Recording (MRR)

− Important in multiprocessor systems

− Capture using hardware support

My Focus: Hardware-based memory race

recording in relaxed-consistency multiprocessors


RelaxReplay Contributions

1) A hardware-based scheme for memory race recording in

multi-processors with relaxed memory models

2) Design independent of memory model details

− Only relies on conventional cache coherence

3) Compact log format

− Log size comparable to previous, less general, schemes

4) Log format enables efficient replay


Memory Race Recording (MRR) in Hardware

Observation: Shared-memory communications manifest as cache

coherence transactions

− Can do MRR by monitoring transactions

Chunk-Based Recording

1) Dynamically, break processor’s execution into chunks of

instructions

− Chunk: the work done between consecutive

communications with other processors

2) Order chunks across processors such that

− Chunk containing source of a dependence is ordered before

chunk containing destination

3) Replay chunks in recorded order to reproduce dependences

Chunk

Chunk

Comm.

Comm.

Comm.


Chunk-Based Recording Example

• # of Chunks << # of dependences

• Chunk content logged efficiently as <# of insts>

→ Compact log

Core 1

LD A

LD B

LD B

Core 2

ST B

ST A

TS=1

TS=2

TS=3

TID=1

CNT=1

TS=3

TID=2

CNT=2

TS=2

TID=1

CNT=2

TS=1

• Each processor keeps a Read Set and a Write Set

• Read and Write Sets are cleared after chunk creation

inval

inval

Read req.

Thread

Interleaving

Order


Problem: Access Re-ordering in Relaxed Models

• Chunk-based recording assumes in-program-order execution of

memory accesses (Sequential Consistency)

• Real processors re-order memory accesses

− Recording “Number of instructions” not enough to capture

work done between consecutive communications

• Previous work support Total Store Ordering (TSO)

• Problem open for more aggressive relaxed model (ARM, IBM, …)

− How to efficiently record heavily re-ordered accesses?


Notion of Interval in RelaxReplay

Instructions in an interval are not in program order

How to compactly log content of an interval?

Interval: Period of time between consecutive communications

with other processors

− Replaces notion of chunk

TS=2

TS=4

TS=6

TS=5

TS=1

TS=3

Core 1 Core 2 Core 3

Intervals


inst1

inst2

inst3

inst4

inst5

inst6

Interval 1 Interval 2

Key Idea: Putting Instructions Back in Program Order

Time

• Record an alternative order of execution with equivalent behavior and

same set of inter-thread dependences

• Vast majority of instructions in program order → Compact log

• Some instructions cannot be put in program order → augment the log

with additional information for correct play

inst1

inst2

inst3

inst4

inst5

inst6

Interval 1 Interval 2


Design: Decouple Instruction Execution and Logging

• Perform (P) Event: When

instruction performs in memory

• Counting (C) Event: When

instruction is added to the log

− At the head of Shadow ROB

− Inst should be performed

→ Happens in program order

. . .

. . .

Not Retired Retired but

Not Counted

ROB

Shadow ROB

head

Performed Not performed

• Deferred logging: instructions go through a logging step, in

program order

− Detects instructions that can be put back in program order

• For each inst, try to move P to C

• Log an interval in terms of its Counting (C) events


Moving P Events to C Events

• Case 1 (very common):

No conflicting coherence request between P and C

→ safe to move P to C

• We put P’s back in program order → in-order inst

• Log a sequence of such insts as <# of insts>

P C

Time

P P C

P

P C

P

inst1

inst2

inst3


Moving P Events to C Events

• Case 2 (very rare):

Conflicting request between P

and C → can’t move P to C

− Reordered inst

− Special log entry

• Reordered loads

− Log the value returned by load

− During replay, read the value from the log (instead of memory)

• Reordered store

− Log the value, address and ID of P interval

Conflicting Request

P C . . .

Interval 10 Interval 10+n

X


Detecting Conflicting Requests

• Naïve solution: Associatively search in-flight instructions

High hardware overhead

• Basic Solution (BASE): Reordered if P and C in different

intervals, otherwise in-order

• Optimized Solution (OPT): Snoop Table

− Table of counters

− When snoop a request, index its

address into the table and increment

the counter

→ Reordered if different snoop counts

at P and C

Line Address


Evaluation Setup

• Processor: 8 cores, 4-way out-of-order, 2 GHz

• Memory subsystem: private L1, shared L2, Ring interconnect

• Memory model: Release Consistency

• Benchmarks: 10 SPLASH-2 programs

• Record:

− Max interval sizes: 4K insts and INF

− 4 configs: 4K-BASE, 4K-OPT, INF-BASE and INF-OPT


Number of Reordered Accesses

0

0.025

0.05

0.075

0.1

0.125

0.15

0.175

0.2

4K-BASE 4K-OPT INF-BASE INF-OPT

% o

f M

em

ory

Insts

Reordered Loads Reordered Stores1.6

0.06

0.03

0.0005

0.16

0.003

0.03

0.0005

• Less than 0.04% of mem. insts re-ordered with OPT

− Regardless of Max Interval Size

→ Compact log


Log Generation Rate

• Small log size: 48 MB/sec (4K-OPT) and 25 MB/sec (INF-OPT)

on 8 processors

− Comparable to (1-4x) previous techniques

− Very small compared to existing memory BW

0

10

20

30

40

50


Bits p

er

1K

in

str

uctions

360

22

42

12


Also in the Paper…

• Can combine RelaxReplay with any existing chunk-based

recording scheme

− In fact, I presented RelaxReplay on top of

QuickRec_[ISCA’13]

• Details of the replay algorithm

• More performance numbers

− Replay performance

− Scalability analysis (different numbers of processors)


Conclusions

• RelaxReplay: Memory race recording for relaxed memory

models

• Design independent of memory model

− Supports Intel, ARM, IBM, Tile, etc.

• Can be combined with existing chunk-based recorders

• Compact log format that enables efficient replay

• In practice, log size comparable to the ideal case of no re-

ordering


THANK YOU!


Replaying RelaxReplay Logs

Before replaying an interval wait for its predecessors

While (interval has entries) {

ent ← next entry of interval

if (ent is Block) {

execute next ent.size instructions

} else if (ent is ReorderedLoad) {

Read the load value from the log

} else if (ent is ReorderedStore) {

Read value and address from the log

Execute the store

}

}

Signal the successors of interval


Replay Overhead

Sequential replay using OS to enforce interval ordering

and emulate re-ordered instructions

0

5

10

15

20

25

30


No

rma

lized

E

xecu

tio

n T

ime

Application Cycles Overhead Cycles

• Efficient replay: 46% (4K-OPT) and 18% (INF-OPT) overhead

cycles

26.2

8.5 8.6 6.7


Example

Interval = 10

i1

i2

LD

i4

i5

ST

i7

i8

... C

C

C

C

C

C

C

C

P

P

P

P

P

P

P

P

Interval = 15

ReorderedLoad ?

ReorderedStore ?

Base Log for Interval 15:

Block(Size=2)

ReorderedLoad(Val)

Block(Size=2)

ReorderedStore(Addr,Val,Interval 10)

Block(Size=2)

Opt Log for Interval 15:

Block(Size=8)

Download - University of Illinois at Urbana-Champaign ...iacoma.cs.uiuc.edu/iacoma-papers/PRES/present_asplos14.pdf · Slide 14 Number of Reordered Accesses 0 0.025 0.05 0.075 0.1 0.125 0.15

Top Related