effective and inexpensive (memory) race recording min xu thesis defense 05/04/2006 electrical and...

73
Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors: Mark Hill, Rastislav Bodik Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood

Upload: cheyanne-hancey

Post on 01-Apr-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

Effective and Inexpensive(Memory) Race Recording

Min Xu

Thesis Defense

05/04/2006

Electrical and Computer Engineering Department, UW-Madison

Advisors: Mark Hill, Rastislav Bodik

Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood

Page 2: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

2OverviewIncreasingly useful to replay multithreaded code• Race recording: key to dealing with nondeterminism

A Case Study• Long recording: 1 byte/kilo-instr• Always-on recording: less than 2% overhead• Low cost: 24 KB RAM/core• Support both SC & TSO (x86-like)

Effective Inexpensive

Race Recorder

Long

Rec

ordi

ng

Mor

e App

licab

le

Low O

verh

ead

Low C

ost

Page 3: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

3

Order-ValueHybrid

RTRAlgorithm

Thesis Contributions

Set/LRUApproximation

CoherencePiggyback

Effective Inexpensive

Low CostHardware

SmallLog Size

Low RuntimeOverhead

SC & TSOApplicability

Page 4: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

4Outline

Motivation & Problem

An Effective and Inexpensive Race Recorder

Evaluation Method & Results

RTRAlgorithm

Set/LRUApproximation

CoherencePiggyback

Order-ValueHybrid

Conclusion & My Other Research

5slides

21

6

3

Page 5: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

Motivation & Problem

Page 6: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

6Multithreaded Debugging

% gcc hash.c% a.outSegmentation fault%

% gdb a.outgdb> runProgram received SIGSEGV.In get() at hash.c:4545 a = bucket->d;

% gdb a.outgdb> runProgram exited normally.gdb>

% gcc para-hash.c% a.outSegmentation fault%

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-hash.c% a.outSegmentation faultRace recorded in “log”%

Page 7: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

7Race Recording

X=6

X = 1

X++

print(X)

X = 1

X++

print(X)

-X = X*5

--

---

X = X*5-

Thread IThread J

Original Replay

X=10

Recording

X= 6

-X = X*5

--

Log

Thread IThread J

Page 8: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

8Recording for Multithreaded Replay

Race Recording• Not-an-issue for a single thread• Create the same general & data races

Checkpointing• Provide a snapshot of the program state• Many proposals (e.g., SafetyNet), not focus

Input Recording• Provide repeatable inputs• Some proposals (e.g., part of FDR), not focus

Focus

Page 9: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

9A Good Race Recorder

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-hash.c% a.outSegmentation faultRace recorded in “log”%

Long recording:small log

Low runtimeoverhead

Low cost

Applicability

Page 10: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

10Desired & Existing Race Recorders

RecordingLength

Applicability

Overhead Cost

DesiredRecorder

Small Log Size

MPRacey

Code

SC

TSONegligible Slowdown

Little Hardware

InstRply ’87

R&C ’90

Bacon’91

Netzer’93

Déjà Vu ’98

RecPlay ’00JaRec ’04Our

Recorder

Page 11: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

Order-ValueHybrid

Set/LRUApproximation

RTRAlgorithm

CoherencePiggyback

SmallLog Size

Page 12: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

12

Reproduce exact same conflicts: no more, no less

Problem Formulation

ld A

Thread I Thread J

Recording

st B

st C

sub

ld B

add

st C

ld B

st A

st C

Thread I Thread J

Replay

Log

ld D

st D

ld A

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Conflicts(red)

Dependence(black)

Page 13: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

13

Detect conflicts Write log

Log All Conflicts

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: 23 14 35 46

Log I: 23

Log Size: 5*16=80 bytes(10 integers)

Dependence Log

16 bytes

Assign IC(logical Timestamps)But too many conflicts

Page 14: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

14Netzer’s Transitive Reduction

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

TR reduced Log J: 23

35 46

Log I: 23

Log Size: 64 bytes(8 integers)

TR Reduced Log

Page 15: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

15The Intuition of the New RTR Algorithm

After Reduction

From I to J

From J to I

Vectors

VectorsRegulate Replay (RTR)

Page 16: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

16

Stricter Dependences to Aid Vectorization

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: 23 45

Log I: 23

Log Size: 48 bytes(6 integers)

New Reduced Log

stricter

Reduced

Page 17: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

17Compress Vectorized Dependencies

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Replay

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Log J: x=3,5, ∆=1

Log I: x=3, ∆=1

Log Size: 40 bytes(5 integers)

Vectorized Log

VectorDeps.

Reduce log size to KB/core/second

Page 18: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

Order-ValueHybrid

Set/LRUApproximation

RTRAlgorithm

CoherencePiggyback

Low RuntimeOverhead

Page 19: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

19Detect Conflicts

1

2

3

1

2

3

4

ld A

Thread I Thread J

Recording

st B

st C

add

st C

ld B

st A

A.readers.add(I, 1)

if (C.writer != I) log(WAW)foreach C.readers if (reader != I) log(WAR)C.readers.clear( )C.writer = (I, 3)

B.writer = (I, 2) C.writer =(J, 2)

if (B.writer != J) log(RAW)B.readers.add(J,3)

Expensive in software

A.readers

A.writer

Page 20: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

20Use Cache and Cache Coherence

ProcI

Tag State Data TimestampA S … 1B M … 4

ProcJ

Tag State Data TimestampA S … 3B I … 2

A.readersA.writer

B.readersB.writer

ld B

Get/S Request

Data Response

Timestamp

Detect conflict in hardware with little runtime cost

RAWDetected& Logged

Page 21: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

21Cache Evictions and Writebacks

ProcI

Tag State Data TimestampA S … 1B M … 4

ProcJ

Tag State Data TimestampA S … 3B I … 2

st A

OK with nonsilent eviction & directory eviction

C M … 3

Directory of A: Shared(I,J) Owner()

Get/SInv

AckTimestamp? WAR

Detected& Logged

M … 4

Page 22: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

22Implement TR and RTR in Hardware

Ideal TR requires vector timestamps• Too expensive• New idea: Pairwise-TR (use scalar timestamp)• Enable pairwise transitive reduction

Optimal RTR algorithm is likely expensive• Implement a greedy RTR algorithm• One-pass, online algorithm• Keep a sliding window of vectorizable

dependencies

Page 23: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

23Hardware Implementation

CacheEviction/writeback Solved, more details

later

Directory protocols Solved

Snooping protocols Partly solved

Two-level coherence Not yet solved

ProcessorOut-of-order/Prefetching Solved

Unordered message Solved

Counter overflow Solved

Thread Migration Not yet solved

Page 24: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

Order-ValueHybrid

Set/LRUApproximation

RTRAlgorithm

CoherencePiggyback

Low CostHardware

Page 25: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

25Timestamp Approximation

Tag State Data TimestampA S … 1B M … 2

One Set of I’s $

Correct, but more evictions more logged conflicts

1

2

3

1

2

3

J

ld A

Thread I Thread J

Recording

st B

st C

add

st C

ld B

st AI ld D

Use current IC of thread

I

C M … 3

Directory of A: Shared(I)

Page 26: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

HardwareCost

Log Size

Page 27: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

27

Tag State Data TimestampA S … 1B M … 2

One Set of I’s $ 1

2

3

1

2

3

J

ld A

Thread I Thread J

st B

st C

add

st C

ld B

st AI ld D

C M … 3

Recording

Set/LRU Approximation

Use current IC of thread

I

LRU guarantee B’s TS > A’s TS

Set/LRU better preserve reducibilitySmall $ more misses but still small log

Page 28: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

28Hardware Cost of Timestamps

Coupled timestamp memory: overhead cache size• Not flexible• 64B line + 64b (24b) timestamp 12.5% (4.7%)

overhead• 192 KB for a 4MB L2

Need to modify cache

Tag State Data TimestampA S … 1B M … 2

Coupled Timestamp Memory

Page 29: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

29Decoupled Timestamp Memory

Decoupling Small timestamp memory (Set/LRU)• e.g., 32-set, 64-way 99% transitive reduction• Timestamps Memory 24 KB

No need to modify cache

Tag State Data TimestampA S … 1B M … 2

Tag State DataA S …B M …

Tag TimestampA 1B 2

Cache

Timestamp Memory

Coupled Timestamp Memory

From 192 KB to 24 KB: 8x reduction

Page 30: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

30

Order-ValueHybrid

Set/LRUApproximation

RTRAlgorithm

CoherencePiggyback

SC & TSOApplicability

Page 31: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

31

ld A

ld B

st A,1

st B,1

ld A

ld B

st A,1

st B,1

ld A

ld B

st A,1

st B,1

A=1B=0

A=0B=1

A=1B=1

Recording with Total Store Order (TSO)

Majority of existing MP are non-SC

TSO is well defined, x86-like

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

A=B=0

ld A

ld B

st A,1

st B,1

A=0B=0

SC

TSO

Page 32: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

32TSO Execution

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

A=B=0 ld A

ld B

st A,1

st B,1

A=0B=0

st A,1

st B,1

I

WrBuf

Memory System

J

WrBuf

A=0 B=0A=0 B=0

A=1 B=1

Page 33: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

33Order-Value-Hybrid Recording

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

Recording

A=B=0

1

2

1

2

st A,1

Thread I Thread J

ld B

st B,1

ld A

Replay Value UsedA=0

ld A

ld B

st A,1

st B,1

A=0B=0

st A,1

st B,1I

WrBuf

Memory System

J

WrBuf

A=0 B=0

WAROmitted Value

Logged

A=0 B=0

A=1 B=1

StartMonitor A

StartMonitor B

A Changed!

StopMonitor B

Page 34: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

34Hybrid Recording with TR and RTR

Hybrid recording• All loads get correct values• Hardware similar to OoO SC [Gharachorloo et al.

’91]

Hybrid + TR & RTR• TR will not use the omitted WAR in reduction• RTR vectorize dependencies more conservatively

Page 35: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

Evaluation Method & Results

Page 36: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

36Put-it-together: Determinizer/CMP

Shared L2 Cache(L1 Dir)

TSM TSM

TSM TSM

Core1

Core2

Core4

Core3

L1_I$ L1_D$

TSM

IC

L1CoherenceController

Log TRReg

RTRReg

Page 37: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

37Simulation Method

Commercial server hardware• GEMS: http://www.cs.wisc.edu/gems• Full-system (OS + application) executions• 4-core CMP (Sequential Consistent)

• 1-way in-order issue, 2 GHz, • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory

Commercial server software• Apache – static web serving• SpecJBB – middleware• OLTP – TPC-C like• Zeus – static web serving

Page 38: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

38Log Size: 1 byte/kilo-instr

Well within in the capability of current machines• Long recording (days – months) need improvement

0.0

0.5

1.0

1.5

2.0byte/core/kilo-instr

ApacheJBB OLTP Zeus AVG0

50

100

150

200KB/core/s

ApacheJBB OLTP Zeus AVG

Page 39: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

39Runtime Overhead

Baseline With race recorder

0

20

40

60

80

100

Execution Time

Apache JBB OLTP Zeus

Interconnection Msg. B/W

Our recorder can be “always-on”

0

80

100

Apache JBB OLTP Zeus

60

40

20

Page 40: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

40Benefits of RTR and Set/LRU (Log Size)

Pairwise-TR

Our RTR

Improvement by RTR

0

20

40

60

80

100

ApacheJBB OLTP ZeusAVG

Perfect TSM

24KB Set/LRU TSM

Effectiveness of Set/LRU

0

20

40

60

80

100

Apache JBB OLTP Zeus AVGL

og

S

ize

Lo

g

Siz

e

Page 41: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

41Why RTR and Set/LRU Work Well?

RTR• Processors execute instructions at similar speed• Therefore, we can find “vectorizable”

dependencies

Set/LRU• Temporal locality makes the LRU timestamps old• We only need to know if a timestamp is “old-

enough”

Page 42: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

42Sensitivity and Scalability

A design space of the timestamp memory (TSM)• Size: smaller TSM -> larger log• Read/write timestamp: should be used when TSM is

large• Partial timestamp: 24-bit enough• Associativity: higher better for RTR

Scalability of the recorder• Studied with modest processors (2p – 16p)• Commercial workloads, not scientific workloads• Log size increase slowly with number of cores

Page 43: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

Conclusion & My Other Research

Page 44: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

44Race Recording

Race recording Key to combat nondeterminism

My thesis An effective & inexpensive Recorder• RTR algorithm small log size• Coherence piggyback Negligible slowdown• Timestamp approximation Low hardware cost• Order-value hybrid support SC & TSO

Future work• Improve race recording algorithm • Improve race recorder implementation• Study race replay

Page 45: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

45

Serializability Violation Detector [PLDI’05]Like a race detectorNo a priori annotation requirement

• “critical sections” are inferredIntend to detect bugs “actually” happen

• Check for a 2-Phase-Locking condition

Read in1

Read in2Write out1

Write out2

Write local

Read local

SharedVariables

A “Critical Section”

Page 46: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

46Publications

FDR (ISCA’03)• Adopted by UCSD BugNet (ISCA’05)

SVD (PLDI’05)• Cited by Vaziri et al. (POPL’06)• Influenced new data race definition

RTR, Set/LRU & Hybrid• Submitted for publication

Page 47: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

Thank you!

% gdb a.out loggdb> runProgram received SIGSEGV.In get() at para-hash.c:6767 a = bucket->d;

% gcc para-hash.c% a.outSegmentation faultRace recorded in “log”%

Page 48: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

48Acknowledgements

Joint work with my advisors• Mark Hill, Ras Bodik

Ph.D. Committee• David Wood, Mikko Lipasti, Remzi Arpaci-Dusseau,

Barton Miller

Multifacet Group• Milo Martin, Dan Sorin, Carl Mauer, Brad Beckmann,

Kevin Moore, Alaa Alameldeen, Mike Marty, Luke Yen

Affiliates & Companies• Joe Emer, CJ Newburn, Peter Hsu, Bob Zak, Eric Bach,

Gang Luo, Alex Chow, IBM, Intel, Microsoft, Sun

Page 49: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

49Deterministic Replay is Useful

Deterministic Replay is logically recreating a program execution

Present applications• Cyclic Debugging ([Pancake & Netzer ‘93])• Fault Tolerance (ExtraVirt [Lucchetti et al. ’05])• Intrusion Analysis (ReVirt [Dunlap et al. ’02])

Future applications• Data Recovery • Replay-based Synchronization

Page 50: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

50Multicore and Multithreading

Multicore is common• AMD X2• IBM Power 5/6, Cell• Intel Pentium D, Core Duo• Sun SPARC T1

Multithreading is common• Server: high throughput• Scientific: high performance• Desktop/embedded: low response time

Page 51: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

51Race Recording: Key to Determinism

Races: general race & data race [Netzer & Miller]• Both cause nondeterminism• Race recording can help, but

Existing race recorders are inadequate• Some generate large logs• Some have high runtime overhead• Some have high hardware cost (space overhead)• Support only sequential consistency

Need a better race recorder

Page 52: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

52Recording/Replay & Debugging

Online Recorder

Crash

Dump “Core”

P1

P2

P3

P4

Checkpoint B Checkpoint C

Store log A Store log B Store log C

Checkpoint A

Crash

Read Checkpoint B

Replaying fromlog B, C

Deterministic Replayer

Page 53: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

53Deterministic Replay & Fault Tolerance

Fault Recovery• Replay after a failure

Fault Detection• Replay then compare

(Courtesy of VMware)

Page 54: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

54Future: Record/Replay & Undo/Redo

VM as a software platform• Ease software development• Fine granularity in Undo and Redo

Windows XP

Page 55: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

55Future: Replay-based Synchronization

Three steps• Coarse-grain sync. fine-grain sync. hardware sync.

Results: higher performance

Works only if static control flow & fixed data addr• DSP kernels

ld Ast B

Unlock()

lock()st Ald B

Recording

ld Ast B st A

ld BReplay

Log

Page 56: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

56Race Recording Related Work

Total-order recorders Partial-order recordersBacon ’91(Hardwar

e)

RecPlay ’00

JaRec ’04

R&C’90

Déjà Vu ’98

Bacon ’91(Hardware

)

Instant Replay ’87

Netzer ’93

Bus transactio

ns

Lamport Clocks

SchedulingBus

transaction groups

Variable versionVector clocks

Large log Small log Small log Large log Large log Small log

Low overhead

Low overhead

(sync only)

Low overhead(non-MP)

Low overhead

High overheadHigh

overhead

Low replay parallelism High replay parallelism

Page 57: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

57Correctness of Order-Value-Hybrid

Removing WAR dependencies• Say thread I read, thread J write• Removing the WAR affects I’s read, not J’s write• But, for every dependence removed, thread I

reads correct value from the value log• Therefore, all reads get the correct value

Page 58: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

58TR and TSO

TR affects dependencies reduced by a WAR• The WAR itself may later be removed during replay• Solution: Not use WAR in TR if the WAR can be

removed• Respond with a special flag when a loaded cache line

is stolen

1

2

1

2

st A

Thread I Thread J

st C

st B

st C

Recording

3 3ld B ld A

Must notbe reduced

Page 59: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

59RTR and TSO

The sliding window may expose the ordered loads• Shrink the sliding window to avoid it

1

2

1

2

st A

Thread I Thread J

add

add

sub

Recording

3 3st B ld A

4 4ld C ld Bordered

in write bufffer

orderednew winfor j:3old win

for j:3

Not allowedby new window

Page 60: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

60Deadlock Avoidance of RTR

1

2

3

4

5

6

1

2

3

4

5

6

ld A

Thread I Thread J

Recording

st B

st C

sub

ld B

add

st C

ld B

st A

st C

ld D

st D

Avoid deadlock by adhere to a SC total order

i:4j:1 j:2 i:3 i:4

Replay Cycle

Page 61: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

61Recording Race-free Executions

No data races

Only need to record synchronization race

Deterministic replay up until the first data race

Page 62: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

62Replay Parallelism

Replay performance depends on

(1)Number of synchronizations(2)Extra wait incurred by the

synchronizations

Page 63: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

63Directory Protocols

Add sticky states in the directory• Retain states after writebacks• Need extra acknowledgements

Or, add extra timestamp memory in the directory• Helps to avoid extra acknowledgements

A tradeoff• Sticky states can be cheaper• But extra timestamp memory can be faster

Page 64: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

64Snooping Protocols

Key problem is combined/implicit response• Not a problem for AMD Hammer

ProcI

Tag State Data TimestampA S … 1B M … 4

ProcJ

Tag State Data TimestampA S … 3B I … 2

st A

Get/XPull Shared

WARDetected& Logged

+ Current IC

Page 65: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

65Nonsilent Evictions

ProcI

Tag State Data TimestampA S … 1B M … 4

ProcJ

Tag State Data TimestampA S … 3B I … 2

st A

Directory eviction: more false conflict, like snooping

C M … 3

Directory of A: Shared(J) Owner() StickyS(I,J)

Get/S

M … 4

AckTimestamp

TimestampMemory

Eviction

Page 66: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

66Out-of-Order & Hardware Prefetching

Speculative execution• No IC assigned yet

Hardware prefetching• No IC assigned

Key idea: receive observation• Can associate a ld/st with current commit

instruction

Page 67: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

67Unordered Messages in Interconnect

Message arrive out-of-order

Can affect reduction

But better add a sequence number• Reconstruct the message order• Enable IC compression by sending deltas

Page 68: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

68Integer Overflow

IC and timestamps may overflow

IC: make it 64bit, will not overflow for a long time

Timestamps: use approximation techniques• MSB of IC + LSB of Timestamps

Page 69: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

69Varying TSM Size

2 4 8 16 32 64 128 256 512 1024 2048

Size of the Timestamp Memory (KB)

0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Apache-1TS-RTRApache-1TS-TRApache-2TS-RTRApache-2TS-TR

(64 ways, Full Timestamps, Set/LRU)

2 4 8 16 32 64 128 256 512 1024 2048

Size of the Timestamp Memory (KB)

0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

OLTP-1TS-RTROLTP-1TS-TROLTP-2TS-RTROLTP-2TS-TR

(64 ways, Full Timestamps, Set/LRU)

2 4 8 16 32 64 128 256 512 1024 2048

Size of the Timestamp Memory (KB)

0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

SPECjbb-1TS-RTRSPECjbb-1TS-TRSPECjbb-2TS-RTRSPECjbb-2TS-TR

(64 ways, Full Timestamps, Set/LRU)

2 4 8 16 32 64 128 256 512 1024 2048

Size of the Timestamp Memory (KB)

0

1

2

3

Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Zeus-1TS-RTRZeus-1TS-TRZeus-2TS-RTRZeus-2TS-TR

(64 ways, Full Timestamps, Set/LRU)

Page 70: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

70Varying Associativity

2 4 8 16 32 64 128 256 512 1024

Associativity of the Timestamp Memory

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Zeus-CurrentIC-RTRZeus-CurrentIC-TRZeus-SetLRU-TRZeus-SetLRU-RTR

(64KB, Full R/W Timestamps)

2 4 8 16 32 64 128 256 512 1024

Associativity of the Timestamp Memory

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

SPECjbb-CurrentIC-RTRSPECjbb-CurrentIC-TRSPECjbb-SetLRU-TRSPECjbb-SetLRU-RTR

(64KB, Full R/W Timestamps)

2 4 8 16 32 64 128 256 512 1024

Associativity of the Timestamp Memory

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

OLTP-CurrentIC-RTROLTP-CurrentIC-TROLTP-SetLRU-TROLTP-SetLRU-RTR

(64KB, Full R/W Timestamps)

2 4 8 16 32 64 128 256 512 1024

Associativity of the Timestamp Memory

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Apache-CurrentIC-RTRApache-CurrentIC-TRApache-SetLRU-TRApache-SetLRU-RTR

(64KB, Full R/W Timestamps)

Page 71: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

71Varying Partial Timestamp Width

10 15 20 25 30

Partial Timestamp Width

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Zeus-TRZeus-RTR

(64sets, 64ways, Set/LRU)

10 15 20 25 30

Partial Timestamp Width

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

SPECjbb-TRSPECjbb-RTR

(64sets, 64ways, Set/LRU)

10 15 20 25 30

Partial Timestamp Width

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

OLTP-TROLTP-RTR

(64sets, 64ways, Set/LRU)

10 15 20 25 30

Partial Timestamp Width

10

1

0.1

0.01Log

Ban

dw

idth

(M

B/c

ore

/secon

d)

Apache-TRApache-RTR

(64sets, 64ways, Set/LRU)

Page 72: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

72Log Size Scaling

2 4 8 16

Number of Cores

0.0

0.2

0.4

0.6

0.8

1.0

Log

Siz

e (

MB

/core

/s)

ApacheSPECjbbOLTPZeus

Page 73: Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

73In Retrospect …

What are you most proud of?• RTR improves TR after 13 years

What would you do differently if doing it again?• “replaying me is deterministic” (just kidding)• I wish I focused on race recording earlier

What the industry should do?• Implement the recorder as a VMM extension