dongyoon lee † , mahmoud said, satish narayanasamy † , and zijiang james yang

- 1 -

Dongyoon Lee†, Mahmoud Said*, Satish Narayanasamy†, and Zijiang James Yang*

University of Michigan, Ann Arbor †

Western Michigan University *

Offline Symbolic Analysis toInfer Total Store Order

- 2 -

Deterministic Replay

What is deterministic replay?• Record and reproduce non-deterministic events• Program input (interrupt, I/O, DMA, etc.)• Shared-memory dependencies

Deterministic replay uses1) Debugging • Reproducing concurrency bugs• Time-travel debugging

2) Heavyweight dynamic analysis3) Forensics

- 3 -

Traditional Deterministic Replay Systems

Write

ReadRead

Log shared memory dependencies

Checkpoint Memory and Register State

Log non-deterministic program input Interrupts, I/O values, DMA, etc.

Thread 1 Thread 2 Thread 3

- 4 -

Recording Shared-Memory Dependencies

Problem• Need to monitor every memory operation

Software-based replay systems• PinSEL (Intel), iDNA (Microsoft)

• ODR (Berkeley), PRES (UCSD)

• Respec/DoublePlay (Michigan)

Hardware-based replay systems• RTR/ReRun (Wisconsin)

• Strata (UCSD)

• DeLorean (UIUC)

• LReplay (CAS)

• Timetraveler (Purdue)

→ 10-100x→ Replay is not guaranteed→ 2x throughput overhead

→ Complex because precise recorder

→ Only few systems support relaxed consistency model

: Total Store Order

- 5 -

Hardware Complexity in Previous Solutions

Support for precisely logging shared-memory dependencies• Monitor and log cache coherence messages

Support for Total Store Order (TSO) model• Detect sequential consistency (SC) violation • Log SC-violating loads

Complexity• Require invasive changes to coherence sub-system• Complex to design and verify • 9 design bugs in coherence mechanism of AMD64

[Narayanasamy et al. ICCD’06]

RTR [Xu et al. ASPLOS’06]

- 6 -

Our Approach

Complexity-effective solution [MICRO’09]

• Do NOT record shared-memory dependencies at all• Infer shared-memory dependencies offline• Using Satisfiability Modulo Theory (SMT) solver

Contribution• Find a causal order compliant to Total Store Order• Bound search space under TSO

- 7 -

System Overview

Write

ReadRead

Log shared memory dependency

Checkpoint Memory and Registers

Log non-deterministic program inputInterrupts, I/O values, DMA, etc.

BugNet [ISCA’05]Load-based Hardware Recorder

Satisfiability-Modulo-Theory (SMT) solver reconstructs thread interleaving offline

Checkpoint Registers

Thread 1 Thread 2 Thread 3

- 8 -

Roadmap

• Motivation• Background: Load-based logging architecture [ISCA’05]

Write Read

Read

BugNet [ISCA’05]Load-based Hardware Recorder

Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline


T1 T2 T3

• Offline Symbolic Analysis• Bounding Search Space• Evaluation• Conclusion

- 9 -

Insight• Recording initial register state and values of loads is sufficient for

deterministic replay• Implicitly captures the program input from I/O, DMA, interrupts, etc.• Input and output of other instructions are reproduced during replay

Optimization• Record a load only if it is the first access to a memory location

Processor support• Recording data fetched on a cache miss captures first loads• Any first access to a location would result in a cache miss• May unnecessarily record data due to store misses, but that is OK

Load-based Logging Architecture [Narayanasamy et al, ISCA’05][Lee et al. MICRO’09]

- 10 -

Recording Cache Miss Data (First Loads)

ExecutionTime

Log file

First Load

Checkpoint• Register Values• Program Counter

Load A = 0

Load A = 0

<cnt1, 0>

Load B = 5 <cnt2, 5>

Store C = 1

On a store miss • Record old value – data before store update • New value – data after store update – can be reproduced deterministically

Cache Miss

Checkpoint

Record cache misses• <Memory count , Data>• Implicitly capture first loads

<cnt3, 0>

Deterministic Replay• Input and output (including address) of all instructions are replayed

- 11 -

Cache Miss Logging for Multithreaded Programs

Insight• Load-based recorder (initial register state + loads) for each thread is

sufficient for replaying that thread=> Recording cache miss data is sufficient for multithreaded programs=> No additional hardware support required for recording dependencies

Reason • Load dependent on a remote write will cause a cache miss to ensure

coherence=> Implicitly records load values dependent on remote writes

Effect• Can replay each thread in isolation (independent of other threads)

regardless of underlying consistency model

- 12 -

Replaying Each Thread Independently

Proc 1 Proc 2

Load A=0

Load A=0

Load A=

Store A=1

Invalidation

Cache Coherence• Invalidates cache block to gain exclusive permission

Log cache miss data• Implicitly records loads dependent on remote writes• No change to coherence mechanism

(1st, 0)

(3rd, 1)

Proc 1 LOG

(1st, 0)

Proc 2 LOG

Cache Miss

Cache BlockInvalidated

1Replay each threadindependent of others

- 13 -

Shared Memory Dependency

Thread 1 Thread 2

Store

Load

Store

Load

Store

Store

Store

SMT Solver for finding shared memory dependencies

Billions of instructions• Offline analysis may not scale

Final State : A, B, C

Strata for bounding search space

A

B

C

A

A

B

C?Old value x New value

Address

Store AOld value

Address

New value

- 14 -

Roadmap

• Motivation• Load-based Logging Architecture• Offline Symbolic Analysis

Write Read

Read

BugNet[ISCA’09]Load-based Hardware Recorder

Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline


T1 T2 T3

• Bounding Search Space• Evaluation• Conclusion

- 15 -

Offline Symbolic Analysis

Step 1) Encode Ordering Constraints• Coherence Constraints• Memory Model Constraints (SC, TSO)

x1

x2

x 3

x 4

x 5

y

y1

2 y3

Coherence Constraints(X1 < X2) AND(X2 < X4 OR X5 < X2) AND…Memory Model Constraints(Y1 < X1 < X2 < Y2) AND(X3 < X4 < X5 < Y3) AND….

Step 2) Find a valid casual order• Satisfiability Module Theory (SMT) Solver• Yices [Dutertre and Moura CAV’06]

x1

x2

x 3

x 4

x 5

y

y1

2 y3

Final State

- 16 -



x1

x2

x 3

x 4

x 5

y

y1

2 y3



x1

x2

x 3

x 4

x 5

y

y1

2 y3

Final State

`

- 17 -

Encoding Coherence Constraints

Proc 1 Proc 2

x1

x2

x3

x4

x5

xFinal

Coherence Constraints( M→old == M→prev→new)

X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND

Old value x New value Address

- 18 -

Multiple Memory Locations

x1

x2

x3

x 4

x 5

xFinal

Coherence Constraints( M→old == M→prev→new)

X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND :Y1: Y1 < Y2 ANDY2: Y2 < Y3 AND

:

y

y1

2 y3

yFinal

Proc 1 Proc 2

Old value x New value Address

- 19 -



x1

x2

x 3

x 4

x 5

y

y1

2 y3



x1

x2

x 3

x 4

x 5

y

y1

2 y3

Final State

`

- 20 -

Encoding Sequential Consistency Constraints

Coherence Constraints( M→old== M→prev→new)

X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND :Y1: Y1 < Y2 ANDY2: Y1 < Y2 < Y3 AND

:

Memory Model Constraints(under Sequential Consistency)

P1 : Y1 < X1 < X2 < Y2 AND P2 : X3 < X4 < X5 < Y3 AND

x1

x2

x3

x 4

x 5

xFinal

y

y1

2 y3

yFinal

Proc 1 Proc 2

- 21 -

Alice = 1 Bob = 1

r1 = r2 = 0, r3 = r4 = 1

P1 P2

r1 = Bob r2 = Alice

Alice = Bob = 0

Alice = Bob = Charlie = 0P1 P2

Alice = 1 Bob = 1

r1 = r2 = 0, r3 = 1r1 = Bob r2 = Alice

Total Store Order Relaxations

Store→LoadAlice = 1 Bob = 1

r1 = Bob r2 = Alicer1 = r2 = 0

P1 P2Alice = Bob = 0

Load →Load

LoadSB-Hit→Load

r3 = CharlieCharlie = 1

r3 = Alice r4= Bob

Stores Store Buffer Hit

Log memory countsof LoadSB-Hit

SC

TSO

SC

TSO

SC

TSO

1 1

00

- 22 -

Encoding Total Store Order Constraints

x1

x2

x3

x 4

x 5

xFinal

y1

y2 y3yFinal

P1 : Y1 < X1 < X2 < Y2 P2 : X3 < X4 < X5 < Y3

AND X1 < Y2 AND X3 < Y3

Stores Store Buffer Hit

Store

Load

LoadSB-Hit

Load

Derived TSO schedule

x1

x2

x3

x 4

xFinal

1y

y Final

x 5

Encoded TSO constraints SMT Solver

y2 y3

- 23 -

Roadmap

• Motivation• BugNet• Offline Symbolic Analysis• Bounding Search Space• Evaluation• Conclusion

- 24 -

Bounding Search Space using Strata

Proc 1 Proc 2

N msgs

N msgs

Final State

hint 1 hint 2

hint 3 hint 4

Strata Region 3

Strata Region 2

Strata Region 1

Final State

Initial State

Final State

Initial State

Final State

Strata• After every N broadcasted

coherence messages, each processor records “Strata Hints”

Strata regions are ordered

SMT Solver• Analyzes one Strata region at a time• Starts from the last Strata region • Initial state of a Strata region = Final state of preceding region

Coherencemessages

Stratum

Stratum

- 25 -

… remains in store-buffer

… executed, not committed yet

Strata Hint = Mem. Count + I-Store CountStrata hint• Every processor records• memory count• number of in-flight stores in the store buffer (called I-Store)

• Reconstruct Stratum offline based on the memory count • Move the pending stores and dependent loads to the next Strata region

{1,1} {1,1}Log{1,1}

Log{1,1}

< Recording Strata Hints >

Alice = 1 Bob = 1

r1 = Bob r2 = Alice

r1 = r2 = 0

P1 P2

r1 = Bob r2 = Alicer1 = r2 = 0

Alice = 1 Bob = 1P1 P2

I-Store Cnt.

Mem Cnt. Mem Cnt.

I-Store Cnt.Alice = Bob = 00

0

0

0

1

1 1

1

< Reconstructing Strata Offline>

- 26 -

Filtering Local, Read-only & Cache-hit Accesses

Load A

Store BLoad B

Store A

Load C

Load CLoad C Load C

Load C

Strata Region

Thread 1 Thread 2Local accesses• No shared-memory dependencies

Read-only accesses• Any order is valid

Intermediate cache-hit accesses• No interleaved accesses on successive

cache hits

Effectiveness• < 1% of memory accesses remain

Store DStore DLoad DStore D

… Miss… Hit… Hit… Hit

Store D

Load D

- 27 -

Roadmap

• Motivation• Load-based Logging Architecture• Offline Symbolic Analysis• Evaluation• Conclusion

- 28 -

EvaluationSimulation Framework• Simics: simulate multi-processor execution (2, 4, 8,16 cores) • Modified FeS2 : simulate Total Store Order• Fast-forward up to known synchronization points• Simulated for 500 million instructions

Benchmarks• SPLASH2: barnes, fmm, ocean• PARSEC2.0: blackscholes, bodytrack, x264• SPEComp: wupwise, swim• Servers: apache, mysql

Offline Symbolic Analysis• Yices SMT solver [Dutertre and Moura CAV’06]

- 29 -

Program Input and LoadSB-Hit Log Size

barne

sfm

moc

ean

black

scho

les

body

track

x264

wupwise

swim

apac

hemys

ql

avera

ge0

100

200

300

400

500

600

Log

Size

(MB/

sec)

192MB/sec

On average, 192 MB/sec (8 threads, TSO)• Dominated by cache miss logging• 2.45% of load instructions read their values from the store buffer

Program Input (Data & Instr. Cache Misses) LoadSB-Hit

- 30 -

Strata Hint Log Size

barne

sfm

moc

ean

black

scho

les

body

track

x264

wupwise

swim

apac

hemys

ql

avera

ge1

10

100

1000

10000

Stra

ta L

og S

ize

(KB/

sec)

On average, 1.2MB/sec (8 threads, TSO, b-bound 10)• 15% increase over SC to log the number of pending stores (I-Store)

No hardware support for precise shared-memory dependency logging

1.2MB/sec

SC TSO (with I-Stores)

- 31 -

Offline Analysis Overhead

barne

sfm

moc

ean

black

scho

les

body

track

x264

wupwise

swim

apac

hemys

ql

avera

ge10

100

1000

10000

Offl

ine

anal

ysis

ove

rhea

d(s

ecs/

one

seco

nd o

f pro

g. e

xec.

)

On average, 260 seconds/sec (8 threads, TSO, b-bound 10)• Overall, 30% more efficient than SC

One time cost before replay

260secs

SC TSO

- 32 -

Performance Overhead etc.

Performance Overhead• Performance is mainly dominated by cache miss logging• Additional logging overhead for LoadSBH and I-Store is small

• On average, <1% slowdown in IPC

Paper includes more results on• Comparisons between different Strata bounding schemes• Effectiveness of cache hit filtering• Scalability results on different number of processors• …

- 33 -

Conclusion

Reproducing concurrent executions is a huge challenge

Our proposal: a complexity-effective hardware solution for TSO• Record cache miss data and store buffer hits• Record Stratum (memory count + store buffer count) to bound search

space• Determine shared memory dependencies using offline symbolic analysis

Result (8 threads, TSO)• Performance overhead: less than 1%• Total log size: 193 MB/sec (Program input + Strata hints)• Offline analysis: 260 seconds/sec (30% more efficient than SC)

- 34 -

Thank you

- 35 -

dongyoon lee † , mahmoud said*, satish narayanasamy † , and zijiang james yang*

Documents

dongyoon lee † , mahmoud said, satish narayanasamy † , and zijiang james yang