parallel and distributed systems laboratory paradise: a toolkit for building reliable concurrent...

26
Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay K. Garg Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 78712 email: [email protected]

Upload: clarence-barrett

Post on 19-Jan-2016

222 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

Parallel and Distributed Systems Laboratory

Paradise: A Toolkit for Building Reliable Concurrent Systems

Trace Verification for Parallel Systems

Vijay K. GargDepartment of Electrical and Computer Engineering

The University of Texas at AustinAustin, TX 78712

email: [email protected]

Page 2: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

2

Talk Outline

Motivation and Overview

Instrumentation

– Clock : Tracking Dependency

Property Checking

– Sensor : Detecting Global Properties

– Slicer : Computation Slicing

Page 3: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

4

Motivation: Reliable System

Concurrent systems are prone to errors.

– Concurrency, nondeterminism, process and channel failures

Techniques to ensure correctness

Modeling: Model Checking and Formal Verification

Bug Hunting: Simulation, Debugging and Verification

Fault-Tolerance

Page 4: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

5

Paradise Environment

Program Monitor

Slicer

Predicate

Observe

Control

Page 5: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

6

Talk Outline

Motivation and Overview

Instrumentation

– Clock : Tracking Dependency

Property Checking

– Sensor : Detecting Global Properties

– Slicer : Computation Slicing

Page 6: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

7

Trace Model: Total Order vs Partial Order

Total order: interleaving of events in a trace

Partial order: Lamport’s happened-before model

f2e1

CS2 CS1

f1 e2

P1

P2

Partial Order Trace

CS2

CS1

e1 e2

f1 f2

e2e1

CS1CS2

f1 f2

Successful Trace

Specification:CS1 Λ CS2

¬CS2 ¬CS1

¬ CS1

¬CS2

¬CS1 ¬ CS2

Faulty Trace

Page 7: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

8

Tracking Dependency

computation: a set of events ordered by “happened before” relation

Problem: Timestamp events to answer

– e happened before f ?

– e concurrent with f ?

Page 8: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

9

Clocks in a Distributed System

Result: s happened before t i the vector at s is less than the vector at t.

Vector Clocks [Fidge 89, Mattern 89]

P1

(1,0,0) (2,1,0) (3,1,0)

P2

(0,1,0) (0,2,0)

P3

(0,0,1) (0,0,2) (2,1,3)

Page 9: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

10

Dynamic Chain Clocks

Problem with vector clocks: scalability, dynamic process structure

Idea: Computing the “chains” in an online fashion [Aggarwal and Garg PODC 05] for relevant events

a

f

eb

d

c

h

g

a b c d

e f g h

A computation with 4 processes The relevant subcomputation

P1

P2

P3

P4

Page 10: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

11

Experimental Results

Simulation of a computation with 1% relevant events

Measured

– number of components vs number of threads

– total time overhead vs number of threads

Page 11: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

12

Talk Outline

Motivation and Overview

Instrumentation

– Clock : Tracking Dependency

Property Checking

– Sensor : Detecting Global Properties

– Slicer : Computation Slicing

Page 12: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

13

Global Property Detection

Predicate: A global condition expressed using variables on processes

– e.g., more than one process is in critical section, there is no token in the system

Problem: find a global state that satisfies the given predicate

P1

P2

G1 G2

Critical section

Page 13: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

14

The Main Difficulty in Partial Order

Algorithm for general predicate [Cooper and Marzullo 91]

Too many global states : A computation may contain as many as O(kn) global states

• k: maximum number of events on a process• n: number of processes

e1 e2

f1 f2

T┴

P1

P2 {e1, ┴} {f1, ┴}

{e1, f1, ┴}

{e2, e1, f1, ┴}

{e2, e1, f2, f1, ┴

{e1, f2, f1, ┴}

{e2, e1, ┴}

{┴}

Page 14: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

15

Efficient Predicate Detection for Special Cases

stable predicate: [Chandy and Lamport 85] • once the predicate becomes true, it stays true e.g.,

deadlock

unstable predicate:• observer independent predicate [Charron-Bost et al 95]

occurs in one interleaving occurs in all interleavings e.g., any disjunction of local predicate

• linear predicate [Chase and Garg 95] e.g., conjunctive predicates such as there is no leader in

the system• relational predicate: x1 + x2 +…+ xn ≥ k [Chase and Garg

95] e.g., violation of k-mutual exclusion

Page 15: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

16

Algorithms for Conjunctive Predicates

Centralized Algorithm [Garg and Waldecker 92] Each non-checker process maintains its local vector and sends to the checker process the chain clock whenever

– local predicate is true

– at most once in each message interval.

Time complexity: Checker requires at most O(n2m) comparisons.

– token based algorithm [Garg and Chase 95]

– completely distributed algorithm [Garg and Chase 95]

– keeping queues shorter [Chiou and Korfhage 95]

– avoiding control messages [Hurfin, Mizuno, Raynal, Singhal 96]

Page 16: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

17

Other Special Classes of Predicates

Relational Predicates

– Let xi: number of token at Pi

– Σxi < k: loss of tokens

– Algorithms: max-flow techniques [Groselj 93, Chase and Garg 95, Wu and Chen 98]

– Dilworth's partition [Tomlinson and Garg 96]

Page 17: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

18

Talk Outline

Motivation and Overview

Instrumentation

– Clock : Tracking Dependency

Property Checking

– Sensor : Detecting Global Properties

– Slicer : Computation Slicing

Page 18: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

19

The Main Idea of Computation Slicing

Partial order trace

slice

state explosion

keep all red global statesslicing

Page 19: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

20

How does Computation Slicing Help?

Partial order trace

slice

retain all global states satisfying b1

slicing for b1

check b1 Λ b2

check b2

satisfy b1

Page 20: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

21

Example

Detect predicate (x*y + z < 5) Λ (x ≥1) Λ (z ≤3)

P1

P2

P3

x

y

z

a

1

b

2

c

-1

d

0

e

0

f

2

g

1

h

3

u

4

v

1

w

2

x

4

{a,e,f,u,v} {b}

{w} {g}

Computation

Slice with respect to (x ≥1) Λ (z ≤3)

Page 21: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

22

Computation Slice

computation slice: a sub-computation such that: [Mittal and Garg 01]

1. it contains all global states of the computation satisfying the given predicate, and

2. it contains the least number of global states

Page 22: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

23

POTA Architecture [Sen Dissertation 04]

Instrumentor

Specification

SlicerPredicate Detector

Trace

Slice

Predicate (Specification)

TranslatorExecuteProgram

ExecuteSPIN

Program

InstrumentedProgram

Promela

Trace Slice

yes/witnessno/counterexample

no/counterexample

yes

Analyzer

Page 23: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

24

Results

Efficient polynomial-time algorithms for computing the slice for:

– linear predicates: [Garg and Mittal 01]• time-complexity: O(n2m)

– general predicate:• Theorem: Given a computation, if a predicate b can be

detected efficiently then the slice for b can also be computed efficiently. [Mittal,Sen and Garg 03]

– combining slices: Boolean operators

– temporal logic operators: EF, AG, EG

– approximate slice: For arbitrary boolean expression

• n: number of processes• m: number of events

Page 24: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

25

Experiments: Dining Philosophers Trace Verification

POTA: Partial Order Trace Analyzer (based on slicing) [Sen and Garg 03]

SPIN: A widely used model checking tool [Holzmann 97]

– SPIN: 250 seconds for n = 6, runs out of memory for n > 6.

– POTA: can handle n= 200. Used 400 seconds.

Predicate: Two neighboring dining philosophers do not eat concurrently

Page 25: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

26

Conclusions

Bug-hunting in concurrent systems

Total order vs. Partial Order

Abstraction like slicing to combat state space explosion problem

Page 26: Parallel and Distributed Systems Laboratory Paradise: A Toolkit for Building Reliable Concurrent Systems Trace Verification for Parallel Systems Vijay

27

Questions

?