dynamic data-race detection in lock-based multi-threaded programs prepared by eli pozniansky under...
TRANSCRIPT
Dynamic Data-Race Detection in
Lock-Based Multi-Threaded Programs
Prepared by Eli Pozniansky under Supervision of Prof. Assaf Schuster
2
Table of Contents
What is a Data-Race? Why Data-Races are Undesired? How Data-Races Can be Prevented? Can Data-Races be Easily Detected? Feasible and Apparent Data-Races Complexity of Data-Race Detection
Program Execution Model Complexity of Computing Ordering Relations Proof of NP/Co-NP Hardness
3
Table of ContentsCont.
So How Data-Races Can be Detected? Lamport’s Happens-Before Approximation
Approaches to Detection of Apparent Data-Races: Static Methods Dynamic Methods:
Post-Mortem Methods On-The-Fly Methods
4
Table of ContentsCont.
Closer Look at Dynamic Methods: DJIT
Local Time Frames Vector Time Frames Predicate for Data-Race Detection Which Accesses to Check? Which Time Frames to Check? Access History First Data-Race Results
5
Table of ContentsCont.
Lockset Locking Discipline The Basic Algorithm Improving Locking Discipline
Initialization Read-Sharing
Refinement for Read-Write Locks False Alarms Results
Summary References
6
What is a Data-Race?
Data-race is an anomaly of concurrent accesses by two or more threads to a shared variable and at least one is for writing.
Example (variable X is global and shared):
Thread 1 Thread 2X=1 T=YZ=2 T=X
7
Why Data-Races areUndesired?
Programs which contain data-races usually demonstrate unexpected and even non-deterministic behavior.
The outcome might depend on specific execution order (A.K.A threads’ interleaving).
Re-running the program may not always produce the same results.
Thus, hard to debug and hard to write correct programs.
8
Why Data-Races areUndesired? - Example
First Interleaving: Thread 1 Thread 21. X=02. T=X3. X++
Second Interleaving: Thread 1 Thread 21. X=02. X++3. T=X
T==0 or T==1?
9
Execution Order
Each thread has a different execution speed, which may change over time.
For an external observer of the time axis, instructions’ execution is ordered in execution order.
Any order is legal. Execution order for a single
thread is called program order.Time
T1
T2
10
How Data-Races Can be Prevented? – Explicit
Synchronization
Idea: In order to prevent undesired concurrent accesses to shared locations, we must explicitly synchronize between threads.
The means for explicit synchronization are: Locks, Mutexes and Critical Sections Barriers Binary Semaphores and Counting Semaphores Monitors Single-Writer/Multiple-Readers (SWMR) Locks Others
11
Synchronization –“Bad” Bank Account Example
Thread 1 Thread 2Deposit( amount ) { Withdraw( amount ) {
balance+=amount; if (balance<amount);} print( “Error” );
elsebalance–
=amount; }
‘Deposit’ and ‘Withdraw’ are not “atomic”!!!
What is the final balance after a series of concurrent deposits and withdraws?
12
Synchronization –“Good” Bank Account
ExampleThread 1 Thread 2Deposit( amount ) { Withdraw( amount ) {
Lock( m ); Lock( m );balance+=amount; if (balance<amount)Unlock( m ); print( “Error” );
} elsebalance–=amount;Unlock( m ); }
Since critical sections can never execute concurrently, this version exhibits no data-races.
Critical Sections
13
Is This Enough?
Is This Enough? Theoretically – YES. Practically – NO.
What if programmer accidentally forgets to place correct synchronization?
How all such data-race bugs can be detected in large program?
14
Can Data-Races be Easily Detected? – No!
Unfortunately, the problem of deciding whether a given program contains potential data-races is computationally hard!!!
There are a lot of execution orders. For t threads of n instructions each the number of possible orders is about tn*t.
In addition to all different schedulings, all possible inputs should be tested as well.
To compound the problem, inserting a detection code in a program can perturb its execution schedule enough to make all errors disappear.
15
Feasible Data-Races
Feasible Data-Races: races that are based on the possible behavior of the program (i.e. semantics of the program’s computation).
These are the actual (!) data-races that can possibly happen in any specific execution.
Locating feasible data-races requires full analyzing of the program’s semantics to determine if the execution could have allowed a and b (accesses to same shared variable) to execute concurrently.
16
Apparent Data-Races
Apparent Data-Races: approximations (!) of feasible data-races that are based on only the behavior of the explicit synchronization performed by some feasible execution (and not the semantics of the program’s computation, i.e. ignoring all conditional statements).
Important, since data-races are usually a result of improper synchronization. Thus easier to detect, but less accurate.
17
Apparent Data-Races Cont.
For example, a and b, accesses to same shared variable in some execution, are said to be ordered, if there is a chain of corresponding explicit synchronization events between them.
Similarly, a and b are said to have potentially executed concurrently if no explicit synchronization prevented them from doing so.
18
Feasible vs. ApparentExample 1
Thread 1 [Ffalse] Thread 2X++;F=true;
while (F==false) {};X– –;
Apparent data-races in the execution above – 1 & 2 (no synchronization chain between racing accesses)
Feasible data-race – 1 only!!! – No feasible execution exists, in which ‘X--’ is performed before ‘X++’ (suppose F is false at start).
Note that protecting ‘F’ only will protect X as well.
2
1
19
Feasible vs. Apparent Example 2
Thread 1 [Ffalse] Thread 2X++; while( 1 ) {Lock( m ); Lock( m );F=true; if ( F == true ) break;Unlock( m ); Unlock( m ); }
X– –; No feasible or apparent data-races exist under
any execution order!!! F is protected by means of lock. The accesses
to X are always ordered and properly synchronized.
20
Complexity ofData-Race Detection
Exactly locating the feasible data-races is an NP-hard problem. Thus, the apparent races, which are simpler to locate, must be detected for debugging.
Fortunately, apparent data-races exist if and only if at least one feasible data-race exists somewhere in the execution.
Yet, the problem of exhaustively locating all apparent data-races still remains NP-hard.
21
Reminder: NP and Co-NP
There is a set of NP problems for which: There is no polynomial solution. There is an exponential solution.
Problem is NP-hard if there is a polynomial reduction from any of the problems in NP to this problem. Problem is NP-complete, if in addition it resides in NP.
Intuitively - if the answer for the problem can be only ‘yes’/‘no’ we can either answer ‘yes’ and stop, or never stop (at least not in polynomial time).
22
Reminder: NP and Co-NP Cont.
There is also a set of Co-NP problems which is complementary to set of NP problems.
For Co-NP-hard problem with answers ‘yes’ or ‘no’, we can only answer ‘no’.
If problem is both in NP and Co-NP, then it’s in P (i.e. there is a polynomial solution).
The problem of checking whether a boolean formula is satisfiable is NP-complete (answer ‘yes’ if satisfiable assignment for variables was found).
Same, but not-satisfiable – Co-NP-complete.
23
Why Data-Race Detectionis NP-Hard?
How can we know that in a program P two accesses, a and b, to the same shared variable are concurrent?
Intuitively – we must check all execution orders of P and see. If we discover an execution order, in which a and b are concurrent, we can report on data-race and stop. Otherwise we should continue checking.
24
Program Execution Model
Consider a class of multi-threaded programs that synchronize by counting semaphores.
Program execution is described by collection of events and two relations over the events.
Synchronization event – instance of some synchronization operation (e.g. signal, wait).
Computation event – instance of a group of statements in same thread, none of which are synchronization operations (e.g. x=x+1).
25
Program Execution Model –Events’ Relations
Temporal ordering relation – a T→ b means that a completes before b begins (i.e. last action of a can affect first action of b).
Shared data dependence relation - a D→ b means that a accesses a shared variable that b later accesses and at least one of the accesses is a modification to variable. Indicates when one event causally affects another.
26
Program Execution Model –Program Execution
Program execution P – a triple <E,T→,D→>, where E is a finite set of events, and T→ and D→ are the above relations that satisfy the following axioms: A1: T→ is an irreflexive partial order (a T↛ a). A2: If a T→ b T↮ c T→ d then a T→ d. A3: If a D→ b then b T↛ a.
Notes: ↛ is a shorthand for ¬(a→b). ↮ is a shorthand for ¬(a→b)⋀¬(b→a). Notice that A1 and A2 imply transitivity of T→
relation
27
Program Execution Model –Feasible Program Execution
Feasible program execution for P – execution of a program that performs exactly the same events as P, but may exhibit different temporal ordering.
Definition: P’=<E’,T’→,D’→> is a feasible program execution for P=<E,T→,D→> (potentially occurred) if F1: E’=E (i.e. exactly the same events), and F2: P’ satisfies the axioms A1 - A3 of the model, and F3: a D→ b ⇒ a D’→ b (i.e. same data dependencies)
Note: Any execution that exhibits the same shared-data dependencies as P will execute exactly the same events as P.
28
Program Execution Model –Ordering Relations
Given a program execution, P=<E,T→,D→>, and the set, F(P), of feasible program executions for P, the following relations (that summarize the temporal orderings present in the feasible program executions) are defined:
Must-Have Could-Have
Happened- Before
a MHB→ b ⇔∀<E,T→,D→>∈F(P), a T→ b
a CHB→ b ⇔∃<E,T→,D→>∈F(P), a T→ b
Concurrent-With
a MCW↔ b ⇔∀<E,T→,D→>∈F(P), a T↮ b
a CCW↔ b ⇔∃<E,T→,D→>∈F(P), a T↮ b
Ordered-With
a MOW↔ b ⇔∀<E,T→,D→>∈F(P), ¬(a T↮ b)
a COW↔ b ⇔∃<E,T→,D→>∈F(P), ¬(a T↮ b)
29
Program Execution Model –Ordering Relations -
Explanation The must-have relations describe orderings
that are guaranteed to be present in all feasible program executions in F(P).
The could-have relations describe orderings that could potentially occur in at least one of the feasible program executions in F(P).
The happened-before relations show events that execute in a specific order, the concurrent-with relations show events that execute concurrently, and the ordered-with relations show events that execute in either order but not concurrently.
30
Complexity of Computing Ordering Relations
The problem of computing any of the must-have ordering relations (MHB, MCW, MOW) is Co-NP-hard and the problem of computing any of the could-have relations (CHB, CCW, COW) is NP-hard.
Theorem 1: Given a program execution, P=<E,T→,D→>, that uses counting semaphores, the problem of deciding whether a MHB→ b, a MCW↔ b or a MOW↔ b (any of the must-have orderings) is Co-NP-hard.
31
Proof of Theorem 1 –Notes
The presented proof is only for the must-have-happened-before (MHB) relation. Proofs for the other relations are analogous.
The proof is a reduction from 3CNFSAT such that any boolean formula is not satisfiable iff a MHB→ b for two events, a and b, defined in the reduction.
The problem of checking whether 3CNFSAT formula is not satisfiable is Co-NP-complete.
The proof can also be extended to programs that use binary semaphores, event style synchronization and other synchronization primitives (and even single counting semaphore).
32
Proof of Theorem 1 –3CNFSAT
An instance of 3CNFSAT is given by: A set of n variables, V={X1,X2, …,Xn}. A boolean formula B consisting of conjunction
of m clauses, B=C1⋀C2⋀…⋀Cm. Each clause Cj=(L1⋁L2⋁L3) is a disjunction of
three literals. Each literal Lk is any variable from V or its
negation - Lk=Xi or Lk=⌐Xi. Example:
B=(X1⋁X2⋁⌐X3)⋀(⌐X2⋁⌐X5⋁X6)⋀(X1⋁X4⋁⌐X5)
33
Proof of Theorem 1 –Idea of the Proof
Given an instance of 3CNFSAT formula, B, we construct a program consisting of 3n+3m+2 threads which use 3n+m+1 semaphores (assumed to be initialized to 0).
The execution of this program simulates a nondeterministic evaluation of B.
Semaphores are used to represent the truth values of each variable and clause.
The execution exhibits certain orderings iff B is not satisfiable.
34
Proof of Theorem 1 –The Construction per Variable For each variable, Xi, the following
three threads are constructed:wait( Ai )signal( Xi )..signal( Xi )
wait( Ai )signal( not-Xi )..signal( not-Xi )
signal( Ai )wait( Pass2 )signal( Ai )
“. . .” indicates as many signal(Xi) (or signal(not-Xi)) operations as the number of occurrences of the literal Xi (or ⌐Xi) in the formula B.
35
Proof of Theorem 1 –The Construction per Variable
The semaphores Xi and not-Xi are used to represent the truth value of variable Xi.
Signaling the semaphore Xi (or not-Xi) represents the assignment of True (or False) to variable Xi.
The assignment is accomplished by allowing either signal(Xi) or signal(not-Xi) to proceed, but not both (due to concurrent wait(A i) operations in two leftmost threads).
36
Proof of Theorem 1 –The Construction per Clause
For each clause, Cj, the following three threads are constructed:
wait( L1 )signal( Cj )
wait( L2 )signal( Cj )
wait( L3 )signal( Cj )
L1, L2 and L3 are the semaphores corresponding to literals in clause Cj (i.e. Xi or not-Xi).
The semaphore Cj represents the truth value of clause Cj. It is signaled iff the truth assignments to variables, cause the clause Cj to evaluate to True.
37
Proof of Theorem 1 –Explanation of Construction
The first 3n threads operate in two phases: The first pass is a non-deterministic guessing
phase in which each variable used in the boolean formula B is assigned a unique truth value. Only one of the Xi and not-Xi semaphores is signaled.
The second pass, which begins after semaphore Pass2 is signaled, is used to ensure that the program doesn’t deadlock – the semaphore operations that were not allowed to execute during the first pass are allowed to proceed.
38
Proof of Theorem 1 –The Final Construction
Additional two threads are created:
There are n ‘signal(Pass2)’ operations – one for each variable.
There are m ‘wait(Cj)’ operations – one for each clause.
wait( C1 )..
wait( Cm )b: skip
a: skip
signal( Pass2 )..
signal( Pass2 )
m n
39
Proof of Theorem 1 –Putting All Together
Event b is reached only after semaphore Cj, for each clause j, has been signaled.
Since the program contains no conditional statements or shared variables, every execution of the program executes the same events and exhibits the same shared-data dependencies (i.e. none).
Claim: For any execution a MHB→ b iff B is not satisfiable.
40
Proof of Theorem 1 –Proving the “if” Part
Assume that B is not satisfiable. Then there is always some clause, Cj, that is
not satisfied by the truth values guessed during the first pass. Thus, no signal(Cj) operation is performed during the first pass.
Event b can’t execute until this signal(Cj) operation is performed, which can then only be done during the second pass.
The second pass doesn’t occur until after event a executes, so event a must precede event b.
Therefore, a MHB→ b.
41
Proof of Theorem 1 –Proving the “only if” Part
Assume that a MHB→ b. This means that there is no execution in which b
either precedes a or executes concurrently with a. Assume by way of contradiction that B is
satisfiable. Then some truth assignment can be guessed
during the first pass that satisfies all of the clauses.
Event b can then execute before event a, contradicting the assumption.
Therefore, B is not satisfiable.
42
Complexity of Computing Ordering Relations – Cont.
Since a MHB→ b iff B is not satisfiable, the problem of deciding a MHB→ b is Co-NP-hard.
By similar reductions, programs can be constructed such that the non-satisfiability of B can be determined from the MCW or MOW relations. The problem of deciding these relations is therefore also Co-NP-hard.
Theorem 2: Given a program execution, P=<E,T→,D→>, that uses counting semaphores, the problem of deciding whether a CHB→ b, a CCW↔ b or a COW↔ b (any of the could-have orderings) is NP-hard.
Proof by similar reductions …
43
Complexity of Race Detection -
Conditions, Loops and Input The presented model is too simplistic. What if conditional statements, like “if” and “while”,
are used? What if an input from user is allowed?
Thread 1 Thread 2
Y = ReadFromInput( );while ( Y < 0 ) Print( Y );X--;
X++;
If Y≥0 there is a data-race on X. Otherwise it is not possible, since ‘X--’ is never reached.
44
Complexity of Race Detection -
“NP-Harder”? The proof above does not use conditional
statements, loops or input from outside. This suggests that the problem of data-race
detection may be even harder than deciding an NP-complete problem.
With loops and recursion, we do not know whether potentially concurrent accesses will indeed be executed, so the question becomes equivalent to the halting problem.
Thus, in general case, race detection is undecidable.
45
So How Data-Races Can be Detected? – Approximations
Since it is intractable problem to decide whether a CHB→ b or a CCW↔ b (needed to detect feasible data-races), the temporal ordering relation T→ should be approximated and apparent data-races located instead.
Recall that apparent data-races exist if and only if at least one feasible race exists.
Yet, it remains a hard problem to locate all apparent data-races.
46
Approximation Example – Lamport’s Happens-Before
The happens-before partial order, denoted hb→, is defined for access events (reads, writes, releases and acquires) that happen in a specific execution, as follows: Program Order: If a and b are events performed by
the same thread, with a preceding b in program order, then a hb→ b.
Release and Acquire: Let a be a release and b be an acquire. If a and b take part in the same synchronization event, then a hb→ b.
Transitivity: If a hb→ b and b hb→ c, then a hb→ c. Shared accesses a and b are concurrent
(denoted by a hb↮ b) if neither a hb→ b nor b hb→ a holds.
47
Approaches to Detection ofApparent Data-Races – Static
There are two main approaches to detection of apparent data-races (sometimes a combination of both is used): Static Methods – perform a compile-time
analysis of the code.– Too conservative. Can’t know or understand the
semantics of the program. Result in excessive number of false alarms that hide the real data-races.
+ Test the program globally – see the full code of the tested program and can warn about all possible errors in all possible executions.
48
Approaches to Detection ofApparent Data-Races –
Dynamic Dynamic Methods – use tracing mechanism
to detect whether a particular execution of a program actually exhibited data-races.+ Detect only those apparent data-races that occur
during a feasible execution.– Test the program locally - consider only one
specific execution path of the program each time. Post-Mortem Methods – after the execution
terminates, analyze the trace of the run and warn about possible data-races that were found.
On-The-Fly Methods – buffer partial trace information in memory, analyze it and detect races as they occur.
49
Approaches to Detection ofApparent Data-Races
No “silver bullet” exists.
The accuracy is of great importance (especially in large programs).
Yet, there is always a tradeoff between the amount of false positives (undetected races) and false negatives (false alarms).
The space and time overheads imposed by the techniques are significant as well.
50
Closer Look atDynamic Methods
We will see two dynamic methods for on-the-fly detection of apparent data-races in lock-based multi-threaded programs: DJIT – based on Lamport’s happens-before
partial order relation and Mattern’s virtual time (vector clocks). Implemented in Millipede and Multipage systems.
Lockset – based on locking discipline and lockset refinement. Implemented in Eraser tool.
51
DJIT (1)Description
Detects the first apparent data-race in a program when it actually occurs.
It is enough to announce only the very first data-race, since later races can be after-effects of the first one.
After the race (or it’s cause) is fixed, the search for other races can proceed.
The main disadvantage of the technique is that it is highly dependent on the scheduling order.
52
DJIT(2)Logical Token
Observation – each synchronization event involves some logical token.
The token is released by one set of threads that reach a certain point in their execution and is acquired by another set of threads.
Once all the members of the corresponding releasing set have released their tokens, members of the acquiring set are allowed to proceed their execution.
53
DJIT(3)Local Time Frames
The execution of each thread is split into a sequence of time frames.
A new time frame starts on each release.
Note that according to the above observation concerning logical tokens: Lock Acquire (or acq) Unlock Release (or rel)
Thread TF
X = 1Lock( m1 )Z = 2Lock( m2 )Y = 3Unlock( m2 )Z = 4Unlock( m1 )X = 5
1
1
1
2
3
54
DJIT(4) Local Time Frames
Claim 1: Let a in thread ta and b in thread tb be two accesses, where a occurs at time frame Ta, and the release in ta, corresponding to the latest acquire in tb which precedes b, occurs at time frame Tsync in ta. Then a hb→ b iff Ta < Tsync.
TFa ta tb
Ta
Trelease
Tsync
acq.a.
rel.
rel(m)...
.
.
.acq
.
.
.
.acq(m
).b
Possible sequence of release-acquire
55
DJIT(5) Local Time Frames
Proof:- If Ta < Tsync then a hb→ release and since release hb→ acquire and acquire hb→ b, we get a hb→ b.- If a hb→ b and since a and b are in distinct threads, then by definition there exists a pair of corresponding release an acquire, so that a hb→ release and acquire hb→ b. It follows that Ta < Trelease ≤ Tsync.
56
DJIT(6)Vector Time Frames (VTF)
For each thread t a vector stt[.] exists, whose size is the maximum number of threads (maxthreads).
stt[t] is the local time frame of thread t. It actually holds the number of ‘releases’ made by thread t.
stt[u] stores the latest local time frame of u, whose release is known by t (to have happened before t’s latest acquire).
If u is an acquirer of t’s release, then u’s vector is updated in the following way:
for k = 0 to maxthreads – 1stu[k] = max( stu[k], stt[k] )
57
DJIT(7)Vector Time Frames
In such way, the vector of u is notified of: The latest time frame of t. The latest time frames of other threads
according to the knowledge of t. Note that a thread can learn about a
release performed by another thread through “gossip”, when this information is transferred through a chain of corresponding release-acquire pairs.
58
Thread 1 Thread 2 Thread 3(1 1 1)
(1 1 1) (1 1 1)
write Xrelease( m1 )read Z
(2 1 1) acquire( m1
)read Yrelease( m2 )write X
(2 1 1)
(2 2 1)acquire( m2 )write X
(2 2 1)
DJIT(8)Vector Time Frames
59
DJIT(9) Vector Time Frames
Claim 2: Let a and b be two accesses in respective threads ta and tb, which happened during respective local time frames Ta and Tb. Let f denote the value of sttb[ta] at the time when b occurs. Then a hb→ b iff Ta < f.
TFa ta tc tb TFb
Ta a.
rel........
.
.
.
.acq
.rel....
.
.
.
.
.
.
.
.acq
.b Tb
60
DJIT(10) Vector Time Frames
Proof:- If a hb→ b and since a and b are in distinct threads, then there exists a chain of releases and corresponding acquires such that the first release in ta and the last acquire in tb, so that a hb→ first release and first release hb→ last acquire. The information on ta’s local time frame is transferred through that chain, reaches tb and stored in sttb[ta] (=f). Thus it follows that Ta < Tfirst release ≤ f.
- If Ta < f then there is a sequence of corresponding release-acquire pairs, which transfer the local time frame from ta to tb, finally resulting in tb “hearing” that ta entered a time frame which is later than Ta. This same sequence can be used to transitively apply the hb→ relation from a to b.
61
DJIT(11) Sequential Consistency
The proposed algorithm assumes a sequential consistency model (SC), which is common in multi-threaded environments.
This means that there exists a global order, R, on all the events in the execution, where R confirms with the view of all processes, and all reads see the most recent written values.
The definition of the hb→ partial relation is consistent with R, in the sense that if a hb→ b then a precedes b in R (otherwise an acquire could precede its corresponding release in the global order, contradicting the view of the acquirer).
62
DJIT(12) Data-Race Detection Using
VTF
Theorem 1: Let a and b be two accesses to the same shared variable in respective threads ta and tb during respective local time frames Ta and Tb. Suppose that at least one of a or b is a write. Assume that a is performed in the global order R prior to b and that it doesn’t constitute a data race with any of the preceding accesses in R. Then a and b form a data-race iff at the time when b occurs it holds that sttb[ta] ≤ Ta.
63
DJIT(13) Data-Race Detection Using
VTF
Proof:- If sttb[ta] ≤ Ta then, by Claim 2, a hb→ b doesn’t hold. Since a precedes b in R, it can not hold that b hb→ a. Thus a and b are concurrent and form a data race (since at least one of them is for writing).- If a and b form a data race then a hb→ b doesn’t hold. Thus, by Claim 2, sttb[ta] ≤ Ta.
64
DJIT(14) Predicate for Data-Race
Detection
The algorithmic aspect of Theorem 1 is encapsulated in the following predicate P:P(a,b) ≜ ( a.type = write ⋁ b.type = write ) ⋀ ⋀ ( a.time_frame ≥ stb.thread_id[a.thread_id] )
P gets two accesses, a and b, to same shared variable, where a occurred earlier (according to the global order R) and b is just performed.
P returns True iff a and b form a data-race.
65
DJIT(15)Which Accesses to Check?
We have assumed that there is a logging mechanism, which records all accesses.
Logging all accesses in all threads and testing the predicate P for each pair of them will impose a great overhead on the system.
Actually some of the accesses can be discarded.
66
Claim 3: Consider an access a in thread ta during time frame Ta, and accesses b and c in thread tb=tc during time frame Tb=Tc. Assume that c precedes b in the program order. If a and b are concurrent, then a and c are concurrent as well.
TFa ta tb TFb
Ta
.
.
.
.a
relc.b
Tc
Tb
Ta a....
.relc.b
Tc
Tb
DJIT(16)Which Accesses to Check?
67
DJIT(17)Which Accesses to Check?
Proof:- Let fb and fc denote the respective values of sttb[ta] when b and c happen. Since sttb[ta] is monotonically increasing, and c precedes b, we know that fb ≥ fc. Since a hb→ b does not hold, we know by Claim 2 that Ta ≥ fb. Thus, Ta ≥ fc and again by Claim 2 we get that a hb→ c is false.- Let fa denote the value of stta[tb] when a happens. Since b hb→ a does not hold, we know by Claim 2 that Tb ≥ fa. Since Tb=Tc we get that Tc ≥ fa. Thus by Claim 2, c hb→ a is false.
68
DJIT(18)Which Accesses to Check?
Recall that we are interested in recording only the first apparent data race which occurs during the execution.
Claim 3 implies that for this purpose, it is sufficient to record only the first read access and the first write access to a variable in each time frame.
In addition it’s sufficient to apply the predicate P to pairs of accesses which are the first in their respective time frames.
69
Thread 1 Thread 2
acquire( m )write Xread Xwrite Xrelease( m )
read X
acquire( m )write Xwrite Xrelease( m )
acquire( m )read Xwrite Xwrite Xrelease( m )DR
DJIT(19)Which Accesses to Check?
!!!
!!!
Only the accesses marked with ‘!!!’ are checked.
!!!
!!!
!!!
70
Assume that in thread ta an access a occurs and thread tb= tc performed a previous (according to the global order R) access b in time frame Tb and another previous access c in time frame Tc so that Tb < Tc.
TFa ta tb=t
c
TFb
Ta
.
.
.
.
.
.
.
.
.a.
b.
acq.
rel..c...
Tb
Tc
DJIT(20) Which Time Frames to
Check?
71
DJIT(21) Which Time Frames to
Check? We want to find only the very first data-race,
when it actually occurs (assuming that all previous accesses didn’t form a data-race).
Claim 4: If a is concurrent with b then it certainly concurrent with c.
Proof: Easy, since Tc > Tb ≥ stta[tb] = stta[tc]. Thus, either pair (a-b or a-c) can be considered
to be the first apparent data-race (since there were no races till a occurred).
This also means that if there is no race between a and c, then there is also no data-race between a and b. Therefore, this pair should not be checked.
72
DJIT(22) Which Time Frames to
Check? We want to support the common SWMR (Single
Writer / Multiple Readers) semantics, allowing concurrent reads but not writes.
Thus, developing the observation above, we need to check current write access to a shared variable v against the last time frame in each of the other threads which recently read from v, and the last time frame in a thread which recently wrote to v.
For current read access to v, it is enough to check against the last time frame in a thread which recently wrote to v.
73
DJIT(23) Which Time Frames to
Check? More formally - Let a be a current access
to a shared variable v in thread ta: If there was a prior write to v in ta and since
that write there were no accesses to v in other threads then there is no need to check anything.
If there was a prior write to v in other thread tb (according to the global order R) and since that write there were no accesses to v in other threads besides ta and tb then it’s sufficient to check a only with the latest access to v in tb (since otherwise we would have found the race earlier according to Claims 3 & 4).
74
DJIT(24) Which Time Frames to
Check? If there were prior reads from v in other
threads t1, t2,…,tk (according to the global order R). Then, if a is a write, it should be checked with each of the most recent reads in t1, t2,…,tk.
If a is a read then it should be checked with the most recent write to v (according to R).
75
DJIT(25)Access History
Applying the above observations (concerning which accesses to check and which time frames to check), it is easy to see that the complexity of checking whether a given access races with previous accesses is small.
Each variable v holds for each of the threads the last time frame in which they read from v and the last time frame in which any of the threads wrote to v. The IDs of the accessing threads are saved as well.
76
DJIT(26) Access History
On each first read and first write to v in a time frame every thread updates the access history of v.
If the access to variable v is a read, the thread checks the recent write to v.
If the access is a write, the thread checks all reads from v by other threads and the recent write to v.tf1/
id1
tf2/id2
... ... ... tfn/idn
tfk/idkV
Time frames of recent readsfrom v – one for each thread
Time frame ofrecent write to
v
77
DJIT(27)Coherency
Actually, the presented algorithm uses only coherency guarantees.
Coherency means that for each variable v there is a global order, Rv, on all operations performed on it.
Hence, the algorithm described above is correct also for coherent systems, which are not necessarily sequentially consistent.
In fact, the algorithm may be alsoapplied to systems with even morerelaxed consistency (a.k.a . weakly ordered systems).
Thread 1 Thread 2
write v1, 1write v2, 2
read v2, 2
read v1, 0
The history is coherent, butnot sequentially consistent.
78
DJIT(28)“First” Apparent Data-Race
Note, that if a and b race each other, then a might also race with accesses that occurred in tb previous to b (as shown in the example of Claim 4).
It is impossible to find these data-races before a occurs.
By the definitions, although the corresponding accesses in tb precede b, their races with a occur simultaneously to the race of b and a, and thus are not considered “earlier”.
The definitions can be refined, defining the first apparent data-race to be the first access in tb with which a apparently races.
This will clearly require a bigger access history.
79
DJIT(29)Why Only “First Data-Race”?
Where in the proofs we used the fact that there were no prior data-races?
Consider the following example:
Since the access history for eachvariable consists of only onerecent write, the data-race [1]-[3] is not detected (though the accesses are concurrent).
This is due to a prior race [1]-[2] and the fact that [2] and [3] are in the same thread.
Hint: In order to locate more than only first data-race for each variable, the write history should contain last time frames of all other threads (and not only the most recent).
Thread 1 Thread 2
write X[1]
write X[2]
release(m)write X[3]
80
DJIT(30)More Than One Data-Race
Actually DJIT can be extended to detect more than only one data-race in a program.
Still, there are some good reasons for not doing so: Recall that later data-races can be after effects of the
first one (the program “goes crazy” after the first race). Only the first data race is guaranteed to be feasible
(though it’s not necessarily a crucial bug). Later races can be apparent and hence irrelevant:
Thread 1 Thread 2
X=1;[1]
F=true;while( !F );X=2;[2]
There is only one feasible data
race – on F (it is false at start).
Thus, if we announce on all possible races, false alarms
areinevitable.
81
DJIT (31)Results
The DJIT algorithm was implemented in several academic systems – Millipede and Multipage.
+ Currently DJIT detects the very first apparent data-race. After the race (or it’s cause) is fixed (or marked to ignore), the search for other races can proceed. The extended version of DJIT can detect all races that appear during the execution.
– Very sensitive to differences in threads’ interleaving. Thus it’s recommended to apply the algorithm every time the program executes (and not only in debug mode).
– Still requires enormous number of runs to ensure that the tested program is race free, yet can not prove it.
82
Lockset (1)Locking Discipline
A locking discipline is a programming policy that ensures the absence of data-races.
A simple, yet common locking discipline is to require that every shared variable is protected by a mutual-exclusion lock.
The Lockset algorithm detects violations of locking discipline.
The main drawback is a possibly excessive number of false alarms.
83
Lockset (2)What is the Difference?
[1] hb→ [2], yet there is a feasible data-race under different scheduling.
Thread 1 Thread 2
Y = Y + 1;[1]
Lock( m );V = V + 1;Unlock( m ); Lock( m );
V = V + 1;Unlock( m );Y = Y + 1;[2]
Thread 1 Thread 2
Y = Y + 1;[1]
Lock( m );Flag = true;Unlock( m );
Lock( m );T = Flag;Unlock( m );
if ( T == true ) Y = Y + 1;[2]
No any locking discipline on Y. Yet [1] and [2] are ordered under all possible schedulings.
84
Lockset (3)The Basic Algorithm
For each shared variable v let C(v) be as set of locks that have protected v for the computation so far.
Let locks_held(t) at any moment be the set of locks held by the thread t at that moment.
The Lockset algorithm:- for each v, init C(v) to the set of all possible locks- on each access to v by thread t:
- C(v) C(v) ∩ locks_held(t)- if C(v) = ∅, issue a warning
85
Lockset (4)Explanation
Clearly, a lock m is in C(v) if in execution up to that point, every thread that has accessed v was holding m at the moment of access.
The process, called lockset refinement, ensures that any lock that consistently protects v is contained in C(v).
If some lock m consistently protects v, it will remain in C(v) till the termination of the program.
86
Lockset (5)Example
The locking discipline for v is violated since no lock protects it consistently.
Program locks_held C(v)
Lock( m1 ); v = v + 1; Unlock( m1 );
Lock( m2 ); v = v + 1; Unlock( m2 );
{ }
{m1}
{ }
{m2}
{ }
{m1, m2}
{m1}
{ } warning
87
Lockset (6)Improving the Locking
Discipline The locking discipline described above is too
strict. There are three very common programming
practices that violate the discipline, yet are free from any data-races: Initialization: Shared variables are usually
initialized without holding any locks. Read-Shared Data: Some shared variables are
written during initialization only and are read-only thereafter.
Read-Write Locks: Read-write locks allow multiple readers to access shared variable, but allow only single writer to do so.
88
Lockset (7)Initialization
When initializing newly allocated data there is no need to lock it, since other threads can not hold a reference to it yet.
Unfortunately, there is no easy way of knowing when initialization is complete.
Therefore, a shared variable is initialized when it is first accessed by a second thread.
As long as a variable is accessed by a single thread, reads and writes don’t update C(v).
89
Lockset (8)Read-Shared Data
There is no need to protect a variable if it’s read-only.
To support unlocked read-sharing, races are reported only after an initialized variable has become write-shared by more than one thread.
90
Lockset (9)Initialization and Read-
Sharing
Newly allocated variables begin in the Virgin state. As various threads read and write the variable, its state changes according to the transition above.
Races are reported only for variables in the Shared-Modified state.
The algorithm becomes more dependent on scheduler.
Virgin Shared-Modified
Exclusive
Shared wr byany thrrd by
any thr
wr byfirst thr wr by
new thr
rd bynew thr
rd/wr byfirst thr
91
Lockset (10)Initialization and Read-
Sharing The states are:
Virgin – Indicates that the data is new and have not been referenced by any other thread.
Exclusive – Entered after the data is first accessed (by a single thread). Subsequent accesses don’t update C(v) (handles initialization).
Shared – Entered after a read access by a new thread. C(v) is updated, but data-races are not reported. In such way, multiple threads can read the variable without causing a race to be reported (handles read-sharing).
Shared-Modified – Entered when more than one thread access the variable and at least one is for writing. C(v) is updated and races are reported as in original algorithm.
92
Lockset (11)Read-Write Locks
Many programs use Single Writer/Multiple Readers (SWMR) locks as well as simple locks.
The basic algorithm doesn’t support correctly such style of synchronization.
Definition: For a variable v, some lock m protects v if m is held in write mode for every write of v, and m is held in some mode (read or write) for every read of v.
93
Lockset (12)Read-Write Locks – Final
Refinement
When the variable enters the Shared-Modified state, the checking is different:
Let locks_held(t) be the set of locks held in any mode by thread t.
Let write_locks_held(t) be the set of locks held in write mode by thread t.
94
Lockset (13)Read-Write Locks – Final
Refinement The refined algorithm (for Shared-
Modified):- for each v, initialize C(v) to the set of all locks- on each read of v by thread t:
- C(v) C(v) ∩ locks_held(t)- if C(v) = ∅, issue a warning
- on each write of v by thread t:- C(v) C(v) ∩ write_locks_held(t)- if C(v) = ∅, issue a warning
Since locks held purely in read mode don’t protect against data-races between the writer and other readers, they are not considered when write occurs and thus removed from C(V).
95
The refined algorithm will still produce a false alarm in the following simple case:
Thread 1 Thread 2 C(v)
Lock( m1 ); v = v + 1; Unlock( m1 );
Lock( m2 ); v = v + 1; Unlock( m2 );
Lock( m1 ); Lock( m2 ); v = v + 1; Unlock( m2 ); Unlock( m1 );
{m1,m2}
{m1}
{ }
Lockset (14)Still False Alarms
96
Lockset (15)Additional False Alarms
Additional possible false alarms are: Queue that implicitly protects its elements by
accessing the queue through locked head and tail fields.
Thread that passes arguments to a worker thread. Since the main thread and the worker thread never access the arguments concurrently, they do not use any locks to serialize their accesses.
Privately implemented SWMR locks,which don’t communicate with Lockset.
True data races that don’t affectthe correctness of the program(for example “benign” races).
if (f == 0)lock(m);if (f == 0)
f = 1;unlock(m);
97
Lockset (16) Results
Lockset was implemented in a full scale testing tool, called Eraser, which is used in industry (not “on paper only”).
+ Eraser was found to be quite insensitive to differences in threads’ interleaving (if applied to programs that are “deterministic enough”).
– Since a superset of apparent data-races is located, false alarms are inevitable.
– Still requires enormous number of runs to ensure that the tested program is race free, yet can not prove it.
– The measured slowdowns are by a factor of 10 to 30.
98
Dynamic Data-Race DetectionSummary
There is no one, better solution. DJIT notifies on one apparent data-race, which is
the very first in the execution. Lockset notifies on a bunch of apparent data-
races, some or even all of them are false alarms. Maybe to combine both techniques? Maybe to combine with other known techniques? Maybe to combine with some static analysis? Maybe better approximations can be found...?
99
Dynamic Data-Race DetectionSummary – Cont.
The solutions are not universal. The data-races that are found, are apparent and
not feasible. Still requires a large number of runs to check as
much executions paths as possible. Since slowdowns can be large, a satisfying testing
can take months. Different (or new) types of synchronization require
different detection techniques. Inserting a detection code in a program can
perturb the threads’ interleaving so that races will disappear (less sensitive in Lockset).
100
References
S. Adve, M. Hill and R. Netzer. Detecting Data Races on Weak Memory Systems. In Proceedings of the 18th Annual Symposium on Computer Architectures, pp. 234-243, May 1991.
A. Itzkovitz, A. Schuster, and O. Zeev-Ben-Mordechai. Towards Integration of Data Race Detection in DSM System. In The Journal of Parallel and Distributed Computing (JPDC), 59(2): pp. 180-203, Nov. 1999
L. Lamport. Time, Clock, and the Ordering of Events in a Distributed System. In Communications of the ACM, 21(7): pp. 558-565, Jul. 1978
F. Mattern. Virtual Time and Global States of Distributed Systems. In Parallel & Distributed Algorithms, pp. 215 226, 1989.
101
ReferencesCont.
R. H. B. Netzer and B. P. Miller. What Are Race Conditions? Some Issues and Formalizations. In ACM Letters on Programming Languages and Systems, 1(1): pp. 74-88, Mar. 1992.
R. H. B. Netzer and B. P. Miller. On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions. In 1990 International Conference on Parallel Processing, 2: pp. 93 97, Aug. 1990
R. H. B. Netzer and B. P. Miller. Detecting Data Races in Parallel Program Executions. In Advances in Languages and Compilers for Parallel Processing, MIT Press 1991, pp. 109-129.
102
ReferencesCont.
S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T.E. Anderson. Eraser: A Dynamic Data Race Detector for Multithreaded Programs. In ACM Transactions on Computer Systems, 15(4): pp. 391-411, 1997
O. Zeev-Ben-Mordehai. Efficient Integration of On-The-Fly Data Race Detection in Distributed Shared Memory and Symmetric Multiprocessor Environments. Research Thesis, May 2001.
103
The End