RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors
Nima Honarmand and Josep Torrellas
University of Illinois at Urbana-Champaign
http://iacoma.cs.uiuc.edu/
1
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 2
RnR: Record and Deterministic Replay
• Recreate execution of a
(parallel) program
− Record: capture non-
deterministic events in a
log
− Replay: use the log to
recreate the exact same
execution
• Use cases: Debugging,
Security, Fault Tolerance, …
Sources of non-determinism
• Program inputs
− Capture using operating system support
• Memory access interleaving
− Memory Race Recording (MRR)
− Important in multiprocessor systems
− Capture using hardware support
My Focus: Hardware-based memory race
recording in relaxed-consistency multiprocessors
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 3
RelaxReplay Contributions
1) A hardware-based scheme for memory race recording in
multi-processors with relaxed memory models
2) Design independent of memory model details
− Only relies on conventional cache coherence
3) Compact log format
− Log size comparable to previous, less general, schemes
4) Log format enables efficient replay
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 4
Memory Race Recording (MRR) in Hardware
Observation: Shared-memory communications manifest as cache
coherence transactions
− Can do MRR by monitoring transactions
Chunk-Based Recording
1) Dynamically, break processor’s execution into chunks of
instructions
− Chunk: the work done between consecutive
communications with other processors
2) Order chunks across processors such that
− Chunk containing source of a dependence is ordered before
chunk containing destination
3) Replay chunks in recorded order to reproduce dependences
Chunk
Chunk
Comm.
Comm.
Comm.
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 5
Chunk-Based Recording Example
• # of Chunks << # of dependences
• Chunk content logged efficiently as <# of insts>
→ Compact log
Core 1
LD A
LD B
LD B
Core 2
ST B
ST A
TS=1
TS=2
TS=3
TID=1
CNT=1
TS=3
TID=2
CNT=2
TS=2
TID=1
CNT=2
TS=1
• Each processor keeps a Read Set and a Write Set
• Read and Write Sets are cleared after chunk creation
inval
inval
Read req.
Thread
Interleaving
Order
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 6
Problem: Access Re-ordering in Relaxed Models
• Chunk-based recording assumes in-program-order execution of
memory accesses (Sequential Consistency)
• Real processors re-order memory accesses
− Recording “Number of instructions” not enough to capture
work done between consecutive communications
• Previous work support Total Store Ordering (TSO)
• Problem open for more aggressive relaxed model (ARM, IBM, …)
− How to efficiently record heavily re-ordered accesses?
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 7
Notion of Interval in RelaxReplay
Instructions in an interval are not in program order
How to compactly log content of an interval?
Interval: Period of time between consecutive communications
with other processors
− Replaces notion of chunk
TS=2
TS=4
TS=6
TS=5
TS=1
TS=3
Core 1 Core 2 Core 3
Intervals
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 8
inst1
inst2
inst3
inst4
inst5
inst6
Interval 1 Interval 2
Key Idea: Putting Instructions Back in Program Order
Time
• Record an alternative order of execution with equivalent behavior and
same set of inter-thread dependences
• Vast majority of instructions in program order → Compact log
• Some instructions cannot be put in program order → augment the log
with additional information for correct play
inst1
inst2
inst3
inst4
inst5
inst6
Interval 1 Interval 2
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 9
Design: Decouple Instruction Execution and Logging
• Perform (P) Event: When
instruction performs in memory
• Counting (C) Event: When
instruction is added to the log
− At the head of Shadow ROB
− Inst should be performed
→ Happens in program order
. . .
. . .
Not Retired Retired but
Not Counted
ROB
Shadow ROB
head
Performed Not performed
• Deferred logging: instructions go through a logging step, in
program order
− Detects instructions that can be put back in program order
• For each inst, try to move P to C
• Log an interval in terms of its Counting (C) events
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 10
Moving P Events to C Events
• Case 1 (very common):
No conflicting coherence request between P and C
→ safe to move P to C
• We put P’s back in program order → in-order inst
• Log a sequence of such insts as <# of insts>
P C
Time
P P C
P
P C
P
inst1
inst2
inst3
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 11
Moving P Events to C Events
• Case 2 (very rare):
Conflicting request between P
and C → can’t move P to C
− Reordered inst
− Special log entry
• Reordered loads
− Log the value returned by load
− During replay, read the value from the log (instead of memory)
• Reordered store
− Log the value, address and ID of P interval
Conflicting Request
P C . . .
Interval 10 Interval 10+n
X
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 12
Detecting Conflicting Requests
• Naïve solution: Associatively search in-flight instructions
High hardware overhead
• Basic Solution (BASE): Reordered if P and C in different
intervals, otherwise in-order
• Optimized Solution (OPT): Snoop Table
− Table of counters
− When snoop a request, index its
address into the table and increment
the counter
→ Reordered if different snoop counts
at P and C
Line Address
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 13
Evaluation Setup
• Processor: 8 cores, 4-way out-of-order, 2 GHz
• Memory subsystem: private L1, shared L2, Ring interconnect
• Memory model: Release Consistency
• Benchmarks: 10 SPLASH-2 programs
• Record:
− Max interval sizes: 4K insts and INF
− 4 configs: 4K-BASE, 4K-OPT, INF-BASE and INF-OPT
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 14
Number of Reordered Accesses
0
0.025
0.05
0.075
0.1
0.125
0.15
0.175
0.2
4K-BASE 4K-OPT INF-BASE INF-OPT
% o
f M
em
ory
Insts
Reordered Loads Reordered Stores1.6
0.06
0.03
0.0005
0.16
0.003
0.03
0.0005
• Less than 0.04% of mem. insts re-ordered with OPT
− Regardless of Max Interval Size
→ Compact log
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 15
Log Generation Rate
• Small log size: 48 MB/sec (4K-OPT) and 25 MB/sec (INF-OPT)
on 8 processors
− Comparable to (1-4x) previous techniques
− Very small compared to existing memory BW
0
10
20
30
40
50
4K-BASE 4K-OPT INF-BASE INF-OPT
Bits p
er
1K
in
str
uctions
360
22
42
12
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 16
Also in the Paper…
• Can combine RelaxReplay with any existing chunk-based
recording scheme
− In fact, I presented RelaxReplay on top of
QuickRec_[ISCA’13]
• Details of the replay algorithm
• More performance numbers
− Replay performance
− Scalability analysis (different numbers of processors)
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 17
Conclusions
• RelaxReplay: Memory race recording for relaxed memory
models
• Design independent of memory model
− Supports Intel, ARM, IBM, Tile, etc.
• Can be combined with existing chunk-based recorders
• Compact log format that enables efficient replay
• In practice, log size comparable to the ideal case of no re-
ordering
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 18
THANK YOU!
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 19
Replaying RelaxReplay Logs
Before replaying an interval wait for its predecessors
While (interval has entries) {
ent ← next entry of interval
if (ent is Block) {
execute next ent.size instructions
} else if (ent is ReorderedLoad) {
Read the load value from the log
} else if (ent is ReorderedStore) {
Read value and address from the log
Execute the store
}
}
Signal the successors of interval
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 20
Replay Overhead
Sequential replay using OS to enforce interval ordering
and emulate re-ordered instructions
0
5
10
15
20
25
30
4K-BASE 4K-OPT INF-BASE INF-OPT
No
rma
lized
E
xecu
tio
n T
ime
Application Cycles Overhead Cycles
• Efficient replay: 46% (4K-OPT) and 18% (INF-OPT) overhead
cycles
26.2
8.5 8.6 6.7
ILLINOIS Nima Honarmand “RelaxReplay: Record and Replay for Relaxed…” Slide 21
Example
Interval = 10
i1
i2
LD
i4
i5
ST
i7
i8
... C
C
C
C
C
C
C
C
P
P
P
P
P
P
P
P
Interval = 15
ReorderedLoad ?
ReorderedStore ?
Base Log for Interval 15:
Block(Size=2)
ReorderedLoad(Val)
Block(Size=2)
ReorderedStore(Addr,Val,Interval 10)
Block(Size=2)
Opt Log for Interval 15:
Block(Size=8)