aadebug 2000 - munchen non-intrusive on-the-fly data race detection using execution replay michiel...
TRANSCRIPT
AADEBUG 2000 - MUNCHEN
Non-intrusive on-the-fly data Non-intrusive on-the-fly data race detection using execution race detection using execution
replayreplay
Michiel Ronsse - Koen De Bosschere
Ghent University - Belgium
AADEBUG2000 - Munchen 2
ContentsContents
Introduction Non-determinism & data races RecPlay
Method Implementation
Example Experimental Evaluation Conclusions
AADEBUG2000 - Munchen 3
IntroductionIntroduction
Developing parallel programs for multiprocessors with shared memory is considered difficult: number of threads running simultaneously co-operation & synchronisation through shared
memory:• too much synchronisation: deadlock• too little synchronisation: race condition
cyclic debugging is impossible due to non-deterministic nature of most parallel programs program execution is not repeatable
AADEBUG2000 - Munchen 4
Causes of non-determinismCauses of non-determinism
Sequential Programs: input (keyboard, disk, network), signals, interrupts, certain system calls (gettimeofday(),…)
Parallel programs: race conditions: two threads accessing the same shared variable (memory
location) in an unsynchronised way and at least one thread modifies the variable
AADEBUG2000 - Munchen 5
Example codeExample code
#include <pthread.h>
unsigned global=5;
thread1(){ global=global+6; }thread2(){ global=global+7; }
main(){pthread_t t1,t2;pthread_create(&t1, NULL, thread1, NULL);pthread_create(&t2, NULL, thread2, NULL);pthread_join(t1, NULL);pthread_join(t2, NULL);printf(“global=%d\n”, global);
}
AADEBUG2000 - Munchen 6
Possible executionsPossible executions
L(5)
global=12 global=18global=11
L(5)
L(5)
L(5)
L(5)
L(11)S(11)
S(12) S(11)S(12)
S(11)
S(18)
A
A
A
A
A
A
AADEBUG2000 - Munchen 7
Race conditionsRace conditions
Two types: synchronisation races:
• doesn’t allow us to use cycli debugging• is not a bug, is desired non-determinism
data races:• doesn’t allow us to use cyclic debugging• is a bug, is undesired non-determinism
distinction is a matter of abstraction Automatic of data races detection is possible
collect all memory references check parallel references
AADEBUG2000 - Munchen 8
Detecting data racesDetecting data races
Static methods: checking the source code for all possible
executions with all possible input NP complete not feasible
Dynamic methods: during an actual execution => only detects data
races during this execution
Removal requires cyclic debugging
AADEBUG2000 - Munchen 9
Dynamic data race detectionDynamic data race detection
Piece of code between two consecutive synchronisation operations: a segment
We collect two sets for all segments i of all thread: L(i) and S(i) with the addresses of all load and store operations
For all parallel segments,
)()()()()()( iSjSjLjSiSiL gives the list of conflicting addresses.
AADEBUG2000 - Munchen 10
Existing race detection methodsExisting race detection methods
Huge overhead causing probe effect and Heisenbugs
Only detect the existence of a data race (and the variable), not the instructions involved.
It is a bug, we need cyclic debugging!
AADEBUG2000 - Munchen 11
RecPlayRecPlay Synchronisation races: execution replay Data races:
detect also enables cyclic debugging
Allows you to detect/remove the first data race Three phases:
record the order of the synchronisation operations replay the synchronisation operations and check for
data races normal replay, without checking for data races
AADEBUG2000 - Munchen 12
OverviewOverview
Chooseinput
Record Replay+detect
Replay+ident.
Replay+debug
Replay+debug
Choosenew input
Theend
Automatic Requires user intervention
AADEBUG2000 - Munchen 13
InstrumentationInstrumentation
JiTI (Just in Time Instrumentation) was developed especially for RecPlay, but it is a generic instrumentation tool
Instruments memory and synchronisation operations
Deals correctly with data in code, code in data, self-modifying code
Clones processes: the original process is used for the data and the instrumented clone is used for the code
No need for recompilation, relinking or instrumentation of files.
AADEBUG2000 - Munchen 14
Execution replayExecution replay
ROLT (Reconstruction of Lamport Timestamps) is used for tracing/replaying the synchronisation operations
Attaches a scaler Lamport timestamp to each synchronisation operation
Delaying synchronisation operations for operations with a smaller timestamp suffices for a correct replay
We only need to log a small subset of all operations
AADEBUG2000 - Munchen 15
Collecting memory operationsCollecting memory operations
We need two lists of adresses per segment i: L(i) and S(i)
A multilevel bitmap is used low memory consumption comparing two bitmaps is easy
We lose information: two accesses to the same variable are counted once. This is however no problem for data race detection
AADEBUG2000 - Munchen 16
Memory bitmapMemory bitmap
9 bit 9 bit 14 bit
AADEBUG2000 - Munchen 17
Detecting parallel segmentsDetecting parallel segments
A vectorclock is attached to each segment
All segment information (two bitmaps+vector timestamps) is kept on a list L.
Each new segment is compared against the segments on list L.
AADEBUG2000 - Munchen 18
Detecting obsolete segmentsDetecting obsolete segments Obsolete segments should be removed from list L.
We use snooped matrix clock in order to detect these segments
AADEBUG2000 - Munchen 19
Detecting obsolete segmentsDetecting obsolete segments
segment on list L
obsolete segment
segment in execution
point of execution
the future
AADEBUG2000 - Munchen 20
Identification phaseIdentification phase
If a data race is detected, we know the address involved the type of operations involved (load or store) the threads involved the segments containing the racing instructions
We need another replayed execution to find the racing instructions themselves (+ call stack, …)
This replay executes at full speed till the racing segments start executing.
AADEBUG2000 - Munchen 21
B2
An ExampleAn Example
AADEBUG2000 - Munchen 22
B2A1
C4P(S1)
An ExampleAn Example
AADEBUG2000 - Munchen 23
B2A1
C4P(S1)
An ExampleAn Example
AADEBUG2000 - Munchen 24
B2A1
C4P(S1)
V(S1)
An ExampleAn Example
AADEBUG2000 - Munchen 25
B2A1
C4P(S1)
V(S1)
An ExampleAn Example
AADEBUG2000 - Munchen 26
B2A1
C4P(S1)
V(S1)
An ExampleAn Example
AADEBUG2000 - Munchen 27
B2A1
C4P(S1)
V(S1)
CA+BA3 V(S2)
An ExampleAn Example
AADEBUG2000 - Munchen 28
B2A1
C4P(S1)
V(S1)
CA+BA3 V(S2)
An ExampleAn Example
AADEBUG2000 - Munchen 29
B2A1
C4P(S1)
V(S1)
CA+BA3 V(S2)
P(S2)
An ExampleAn Example
AADEBUG2000 - Munchen 30
B2A1
C4P(S1)
V(S1)
CA+BA3 V(S2)
P(S2)
An ExampleAn Example
AADEBUG2000 - Munchen 31
B2A1
C4P(S1)
V(S1)
CA+BA3 V(S2)
P(S2)
An ExampleAn Example
AADEBUG2000 - Munchen 32
B2A1
C4P(S1)
V(S1)
CA+BA3 V(S2)
P(S2)
An ExampleAn Example
AADEBUG2000 - Munchen 33
B2A1
C4P(S1)
V(S1)
CA+BA3
P(S2)
V(S3)
V(S2)
An ExampleAn Example
AADEBUG2000 - Munchen 34
B2A1
C4P(S1)
V(S1)
CA+BA3
P(S2)
V(S3)
V(S2)
An ExampleAn Example
AADEBUG2000 - Munchen 35
B2A1
C4P(S1)
V(S1)
CA+BA3
P(S2)
V(S3)
V(S2)
P(S3)
An ExampleAn Example
AADEBUG2000 - Munchen 36
B2A1
C4P(S1)
V(S1)
CA+BA3
P(S2)
V(S3)
V(S2)
P(S3)
An ExampleAn Example
AADEBUG2000 - Munchen 37
B2A1
C4P(S1)
V(S1)
CA+BA3
P(S2)
V(S3)
V(S2)
P(S3)
An ExampleAn Example
AADEBUG2000 - Munchen 38
B2A1
C4P(S1)
V(S1)
CA+BA3
P(S2)
V(S3)
V(S2)
P(S3)
An ExampleAn Example
AADEBUG2000 - Munchen 39
B2A1
C4P(S1)
V(S1)
CA+BA3
P(S2)
V(S3)
V(S2)
P(S3)
An ExampleAn Example
AADEBUG2000 - Munchen 40
B2A1
C4P(S1)
V(S1)
CA+BA3
P(S2)
V(S3)
V(S2)
P(S3)
An ExampleAn Example
AADEBUG2000 - Munchen 41
Experimental EvaluationExperimental Evaluation
RecPlay has been implemented for Solaris running on SPARC multiprocessors
Tested on a SUN SparcServer 1000 with 4 processors
SPLASH-2 was used as a benchmark number of multithreaded numeric applications,
such as fast fourier transform, a raytracer, ... Several data races were found, including in
SPLASH-2
AADEBUG2000 - Munchen 42
Basic performance of RecPlayBasic performance of RecPlay
program normal record replay+detectruntime runtime slowdown runtime slowdown
cholesky 8.67 8.88 1.024 721.4 83.2fft 8.76 8.83 1.008 82.8 8.3LU 6.36 6.40 1.006 144.5 22.7radix 6.03 6.20 1.028 182.8 30.3ocean 4.96 5.06 1.020 107.7 21.7raytrace 9.89 10.19 1.030 675.9 68.3water-Nsq. 9.46 9.71 1.026 321.5 34.0water-spat. 8.12 8.33 1.026 258.8 31.9radiosity 21.13 21.50 1.018 datarace foundaverage 1.021 30.6
AADEBUG2000 - Munchen 43
Segments with memory accessesSegments with memory accesses
program created max. stored comparedcholesky 13983 1915 (13.7%) 968154fft 181 37 (20.5%) 2347LU 1285 42 (3.3%) 18891radix 303 36 (11.9%) 4601ocean 14150 47 (0.3%) 272037raytrace 97598 62 (0.1%) 337743water-Nsq. 637 48 (7.5%) 7717water-spat. 639 45 (7.0%) 7962radiosity 438763 6834 (2.0%) 188323337
AADEBUG2000 - Munchen 44
Efficiency of the ROLT mechanismEfficiency of the ROLT mechanism
program number of trace bandwidthsync op. size bytes/s bits/op
cholesky 13857 1132 127.5 0.65fft 177 65 7.4 2.94LU 1275 134 20.9 0.84radix 273 108 17.4 3.16ocean 22987 6458 1276.3 2.25raytrace 150960 41416 4064.4 2.19water-Nsq. 631 336 34.6 4.26water-spat. 625 332 39.9 4.25radiosity 524667 24578 1143.2 0.37average 748.0 2.30
AADEBUG2000 - Munchen 45
ConclusionsConclusions
RecPlay is a practical and effictient tool for detecting and removing data races
RecPlay also make cyclic debugging possible Three types of clocks (scalar, vector and
matrix) are used to enable a fast and memory-effictient implementation
Data races have been found