dongyoon lee † , mahmoud said*, satish narayanasamy † , and zijiang james yang*

35
- Dongyoon Lee , Mahmoud Said*, Satish Narayanasamy , and Zijiang James Yang* University of Michigan, Ann Arbor Western Michigan University * Offline Symbolic Analysis to Infer Total Store Order

Upload: reuben

Post on 22-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Offline Symbolic Analysis to Infer Total Store Order. Dongyoon Lee † , Mahmoud Said*, Satish Narayanasamy † , and Zijiang James Yang* University of Michigan, Ann Arbor † Western Michigan University *. Deterministic Replay. What is deterministic replay? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 1 -

Dongyoon Lee†, Mahmoud Said*, Satish Narayanasamy†, and Zijiang James Yang*

University of Michigan, Ann Arbor †

Western Michigan University *

Offline Symbolic Analysis toInfer Total Store Order

Page 2: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 2 -

Deterministic Replay

What is deterministic replay?• Record and reproduce non-deterministic events• Program input (interrupt, I/O, DMA, etc.)• Shared-memory dependencies

Deterministic replay uses1) Debugging • Reproducing concurrency bugs• Time-travel debugging

2) Heavyweight dynamic analysis3) Forensics

Page 3: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 3 -

Traditional Deterministic Replay Systems

Write

ReadRead

Log shared memory dependencies

Checkpoint Memory and Register State

Log non-deterministic program input Interrupts, I/O values, DMA, etc.

Thread 1 Thread 2 Thread 3

Page 4: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 4 -

Recording Shared-Memory Dependencies

Problem• Need to monitor every memory operation

Software-based replay systems• PinSEL (Intel), iDNA (Microsoft)

• ODR (Berkeley), PRES (UCSD)

• Respec/DoublePlay (Michigan)

Hardware-based replay systems• RTR/ReRun (Wisconsin)

• Strata (UCSD)

• DeLorean (UIUC)

• LReplay (CAS)

• Timetraveler (Purdue)

→ 10-100x→ Replay is not guaranteed→ 2x throughput overhead

→ Complex because precise recorder

→ Only few systems support relaxed consistency model

: Total Store Order

Page 5: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 5 -

Hardware Complexity in Previous Solutions

Support for precisely logging shared-memory dependencies• Monitor and log cache coherence messages

Support for Total Store Order (TSO) model• Detect sequential consistency (SC) violation • Log SC-violating loads

Complexity• Require invasive changes to coherence sub-system• Complex to design and verify • 9 design bugs in coherence mechanism of AMD64

[Narayanasamy et al. ICCD’06]

RTR [Xu et al. ASPLOS’06]

Page 6: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 6 -

Our Approach

Complexity-effective solution [MICRO’09]

• Do NOT record shared-memory dependencies at all• Infer shared-memory dependencies offline• Using Satisfiability Modulo Theory (SMT) solver

Contribution• Find a causal order compliant to Total Store Order• Bound search space under TSO

Page 7: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 7 -

System Overview

Write

ReadRead

Log shared memory dependency

Checkpoint Memory and Registers

Log non-deterministic program inputInterrupts, I/O values, DMA, etc.

BugNet [ISCA’05]Load-based Hardware Recorder

Satisfiability-Modulo-Theory (SMT) solver reconstructs thread interleaving offline

Checkpoint Registers

Thread 1 Thread 2 Thread 3

Page 8: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 8 -

Roadmap

• Motivation• Background: Load-based logging architecture [ISCA’05]

Write Read

Read

BugNet [ISCA’05]Load-based Hardware Recorder

Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline

Checkpoint Registers

T1 T2 T3

• Offline Symbolic Analysis• Bounding Search Space• Evaluation• Conclusion

Page 9: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 9 -

Insight• Recording initial register state and values of loads is sufficient for

deterministic replay• Implicitly captures the program input from I/O, DMA, interrupts, etc.• Input and output of other instructions are reproduced during replay

Optimization• Record a load only if it is the first access to a memory location

Processor support• Recording data fetched on a cache miss captures first loads• Any first access to a location would result in a cache miss• May unnecessarily record data due to store misses, but that is OK

Load-based Logging Architecture [Narayanasamy et al, ISCA’05][Lee et al. MICRO’09]

Page 10: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 10 -

Recording Cache Miss Data (First Loads)

ExecutionTime

Log file

First Load

Checkpoint• Register Values• Program Counter

Load A = 0

Load A = 0

<cnt1, 0>

Load B = 5 <cnt2, 5>

Store C = 1

On a store miss • Record old value – data before store update • New value – data after store update – can be reproduced deterministically

Cache Miss

Checkpoint

Record cache misses• <Memory count , Data>• Implicitly capture first loads

<cnt3, 0>

Deterministic Replay• Input and output (including address) of all instructions are replayed

Page 11: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 11 -

Cache Miss Logging for Multithreaded Programs

Insight• Load-based recorder (initial register state + loads) for each thread is

sufficient for replaying that thread=> Recording cache miss data is sufficient for multithreaded programs=> No additional hardware support required for recording dependencies

Reason • Load dependent on a remote write will cause a cache miss to ensure

coherence=> Implicitly records load values dependent on remote writes

Effect• Can replay each thread in isolation (independent of other threads)

regardless of underlying consistency model

Page 12: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 12 -

Replaying Each Thread Independently

Proc 1 Proc 2

Load A=0

Load A=0

Load A=

Store A=1

Invalidation

Cache Coherence• Invalidates cache block to gain exclusive permission

Log cache miss data• Implicitly records loads dependent on remote writes• No change to coherence mechanism

(1st, 0)

(3rd, 1)

Proc 1 LOG

(1st, 0)

Proc 2 LOG

Cache Miss

Cache BlockInvalidated

1Replay each threadindependent of others

Page 13: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 13 -

Shared Memory Dependency

Thread 1 Thread 2

Store

Load

Store

Load

Store

Store

Store

SMT Solver for finding shared memory dependencies

Billions of instructions• Offline analysis may not scale

Final State : A, B, C

Strata for bounding search space

A

B

C

A

A

B

C?Old value x New value

Address

Store AOld value

Address

New value

Page 14: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 14 -

Roadmap

• Motivation• Load-based Logging Architecture• Offline Symbolic Analysis

Write Read

Read

BugNet[ISCA’09]Load-based Hardware Recorder

Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline

Checkpoint Registers

T1 T2 T3

• Bounding Search Space• Evaluation• Conclusion

Page 15: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 15 -

Offline Symbolic Analysis

Step 1) Encode Ordering Constraints• Coherence Constraints• Memory Model Constraints (SC, TSO)

x1

x2

x 3

x 4

x 5

y

y1

2 y3

Coherence Constraints(X1 < X2) AND(X2 < X4 OR X5 < X2) AND…Memory Model Constraints(Y1 < X1 < X2 < Y2) AND(X3 < X4 < X5 < Y3) AND….

Step 2) Find a valid casual order• Satisfiability Module Theory (SMT) Solver• Yices [Dutertre and Moura CAV’06]

x1

x2

x 3

x 4

x 5

y

y1

2 y3

Final State

Page 16: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 16 -

Offline Symbolic Analysis

Step 1) Encode Ordering Constraints• Coherence Constraints• Memory Model Constraints (SC, TSO)

x1

x2

x 3

x 4

x 5

y

y1

2 y3

Coherence Constraints(X1 < X2) AND(X2 < X4 OR X5 < X2) AND…Memory Model Constraints(Y1 < X1 < X2 < Y2) AND(X3 < X4 < X5 < Y3) AND….

Step 2) Find a valid casual order• Satisfiability Module Theory (SMT) Solver• Yices [Dutertre and Moura CAV’06]

x1

x2

x 3

x 4

x 5

y

y1

2 y3

Final State

`

Page 17: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 17 -

Encoding Coherence Constraints

Proc 1 Proc 2

x1

x2

x3

x4

x5

xFinal

Coherence Constraints( M→old == M→prev→new)

X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND

Old value x New value Address

Page 18: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 18 -

Multiple Memory Locations

x1

x2

x3

x 4

x 5

xFinal

Coherence Constraints( M→old == M→prev→new)

X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND :Y1: Y1 < Y2 ANDY2: Y2 < Y3 AND

:

y

y1

2 y3

yFinal

Proc 1 Proc 2

Old value x New value Address

Page 19: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 19 -

Offline Symbolic Analysis

Step 1) Encode Ordering Constraints• Coherence Constraints• Memory Model Constraints (SC, TSO)

x1

x2

x 3

x 4

x 5

y

y1

2 y3

Coherence Constraints(X1 < X2) AND(X2 < X4 OR X5 < X2) AND…Memory Model Constraints(Y1 < X1 < X2 < Y2) AND(X3 < X4 < X5 < Y3) AND….

Step 2) Find a valid casual order• Satisfiability Module Theory (SMT) Solver• Yices [Dutertre and Moura CAV’06]

x1

x2

x 3

x 4

x 5

y

y1

2 y3

Final State

`

Page 20: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 20 -

Encoding Sequential Consistency Constraints

Coherence Constraints( M→old== M→prev→new)

X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND :Y1: Y1 < Y2 ANDY2: Y1 < Y2 < Y3 AND

:

Memory Model Constraints(under Sequential Consistency)

P1 : Y1 < X1 < X2 < Y2 AND P2 : X3 < X4 < X5 < Y3 AND

x1

x2

x3

x 4

x 5

xFinal

y

y1

2 y3

yFinal

Proc 1 Proc 2

Page 21: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 21 -

Alice = 1 Bob = 1

r1 = r2 = 0, r3 = r4 = 1

P1 P2

r1 = Bob r2 = Alice

Alice = Bob = 0

Alice = Bob = Charlie = 0P1 P2

Alice = 1 Bob = 1

r1 = r2 = 0, r3 = 1r1 = Bob r2 = Alice

Total Store Order Relaxations

Store→LoadAlice = 1 Bob = 1

r1 = Bob r2 = Alicer1 = r2 = 0

P1 P2Alice = Bob = 0

Load →Load

LoadSB-Hit→Load

r3 = CharlieCharlie = 1

r3 = Alice r4= Bob

Stores Store Buffer Hit

Log memory countsof LoadSB-Hit

SC

TSO

SC

TSO

SC

TSO

1 1

00

Page 22: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 22 -

Encoding Total Store Order Constraints

x1

x2

x3

x 4

x 5

xFinal

y1

y2 y3yFinal

P1 : Y1 < X1 < X2 < Y2 P2 : X3 < X4 < X5 < Y3

AND X1 < Y2 AND X3 < Y3

Stores Store Buffer Hit

Store

Load

LoadSB-Hit

Load

Derived TSO schedule

x1

x2

x3

x 4

xFinal

1y

y Final

x 5

Encoded TSO constraints SMT Solver

y2 y3

Page 23: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 23 -

Roadmap

• Motivation• BugNet• Offline Symbolic Analysis• Bounding Search Space• Evaluation• Conclusion

Page 24: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 24 -

Bounding Search Space using Strata

Proc 1 Proc 2

N msgs

N msgs

Final State

hint 1 hint 2

hint 3 hint 4

Strata Region 3

Strata Region 2

Strata Region 1

Final State

Initial State

Final State

Initial State

Final State

Strata• After every N broadcasted

coherence messages, each processor records “Strata Hints”

Strata regions are ordered

SMT Solver• Analyzes one Strata region at a time• Starts from the last Strata region • Initial state of a Strata region = Final state of preceding region

Coherencemessages

Stratum

Stratum

Page 25: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 25 -

… remains in store-buffer

… executed, not committed yet

Strata Hint = Mem. Count + I-Store CountStrata hint• Every processor records• memory count• number of in-flight stores in the store buffer (called I-Store)

• Reconstruct Stratum offline based on the memory count • Move the pending stores and dependent loads to the next Strata region

{1,1} {1,1}Log{1,1}

Log{1,1}

< Recording Strata Hints >

Alice = 1 Bob = 1

r1 = Bob r2 = Alice

r1 = r2 = 0

P1 P2

r1 = Bob r2 = Alicer1 = r2 = 0

Alice = 1 Bob = 1P1 P2

I-Store Cnt.

Mem Cnt. Mem Cnt.

I-Store Cnt.Alice = Bob = 00

0

0

0

1

1 1

1

< Reconstructing Strata Offline>

Page 26: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 26 -

Filtering Local, Read-only & Cache-hit Accesses

Load A

Store BLoad B

Store A

Load C

Load CLoad C Load C

Load C

Strata Region

Thread 1 Thread 2Local accesses• No shared-memory dependencies

Read-only accesses• Any order is valid

Intermediate cache-hit accesses• No interleaved accesses on successive

cache hits

Effectiveness• < 1% of memory accesses remain

Store DStore DLoad DStore D

… Miss… Hit… Hit… Hit

Store D

Load D

Page 27: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 27 -

Roadmap

• Motivation• Load-based Logging Architecture• Offline Symbolic Analysis• Evaluation• Conclusion

Page 28: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 28 -

EvaluationSimulation Framework• Simics: simulate multi-processor execution (2, 4, 8,16 cores) • Modified FeS2 : simulate Total Store Order• Fast-forward up to known synchronization points• Simulated for 500 million instructions

Benchmarks• SPLASH2: barnes, fmm, ocean• PARSEC2.0: blackscholes, bodytrack, x264• SPEComp: wupwise, swim• Servers: apache, mysql

Offline Symbolic Analysis• Yices SMT solver [Dutertre and Moura CAV’06]

Page 29: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 29 -

Program Input and LoadSB-Hit Log Size

barne

sfm

moc

ean

black

scho

les

body

track

x264

wupwise

swim

apac

hemys

ql

avera

ge0

100

200

300

400

500

600

Log

Size

(MB/

sec)

192MB/sec

On average, 192 MB/sec (8 threads, TSO)• Dominated by cache miss logging• 2.45% of load instructions read their values from the store buffer

Program Input (Data & Instr. Cache Misses) LoadSB-Hit

Page 30: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 30 -

Strata Hint Log Size

barne

sfm

moc

ean

black

scho

les

body

track

x264

wupwise

swim

apac

hemys

ql

avera

ge1

10

100

1000

10000

Stra

ta L

og S

ize

(KB/

sec)

On average, 1.2MB/sec (8 threads, TSO, b-bound 10)• 15% increase over SC to log the number of pending stores (I-Store)

No hardware support for precise shared-memory dependency logging

1.2MB/sec

SC TSO (with I-Stores)

Page 31: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 31 -

Offline Analysis Overhead

barne

sfm

moc

ean

black

scho

les

body

track

x264

wupwise

swim

apac

hemys

ql

avera

ge10

100

1000

10000

Offl

ine

anal

ysis

ove

rhea

d(s

ecs/

one

seco

nd o

f pro

g. e

xec.

)

On average, 260 seconds/sec (8 threads, TSO, b-bound 10)• Overall, 30% more efficient than SC

One time cost before replay

260secs

SC TSO

Page 32: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 32 -

Performance Overhead etc.

Performance Overhead• Performance is mainly dominated by cache miss logging• Additional logging overhead for LoadSBH and I-Store is small

• On average, <1% slowdown in IPC

Paper includes more results on• Comparisons between different Strata bounding schemes• Effectiveness of cache hit filtering• Scalability results on different number of processors• …

Page 33: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 33 -

Conclusion

Reproducing concurrent executions is a huge challenge

Our proposal: a complexity-effective hardware solution for TSO• Record cache miss data and store buffer hits• Record Stratum (memory count + store buffer count) to bound search

space• Determine shared memory dependencies using offline symbolic analysis

Result (8 threads, TSO)• Performance overhead: less than 1%• Total log size: 193 MB/sec (Program input + Strata hints)• Offline analysis: 260 seconds/sec (30% more efficient than SC)

Page 34: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 34 -

Thank you

Page 35: Dongyoon  Lee † ,  Mahmoud  Said*,  Satish Narayanasamy † ,  and  Zijiang  James Yang*

- 35 -