dcatch: automatically detecting distributed concurrency ...shanlu/paper/dcatch-final.pdfmachine1...

93
DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Haopeng Liu, Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian* *

Upload: others

Post on 06-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

1

DCatch: Automatically Detecting Distributed Concurrency Bugs

in Cloud Systems

Haopeng Liu, Guangpu Li, Jeffrey Lukman, Jiaxin Li,

Shan Lu, Haryadi Gunawi, and Chen Tian*

*

Page 2: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

2

Cloud systems

Page 3: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

3

Cloud systems

Page 4: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

4

Distributed concurrency bugs (DCbugs)

Page 5: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

5

Distributed concurrency bugs (DCbugs)

• Unexpected timing among distributed operations

Page 6: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

6

Distributed concurrency bugs (DCbugs)

• Unexpected timing among distributed operations

• Example

BA C

MapReduce-3274

Page 7: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

7

Distributed concurrency bugs (DCbugs)

• Unexpected timing among distributed operations

• Example

BA C BA C

MapReduce-3274

hang

Page 8: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

8

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

– 26% failures caused by non-deterministic [1]

– 6% software bugs in clouds system [2]

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

Page 9: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

9

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

Page 10: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

10

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

Page 11: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

11

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

Page 12: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

12

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

“We have already fix many cases, however it seems exist many other [racing] cases.”

HBase / HBASE-6147

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

Page 13: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

13

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

“We have already found and fix many cases, however it seems exist many other cases.”

HBase / HBASE-6147

“This has become quite messy, sigh.”

Hadoop Map/Reduce / MAPREDUCE-4819

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

Page 14: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

14

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

“We have already found and fix many cases, however it seems exist many other cases.”

HBase / HBASE-6147

“This has become quite messy, sigh.”

Hadoop Map/Reduce / MAPREDUCE-4819

“Great catch, Sid! Apologies for missing the race condition.”

Hadoop Map/Reduce / MAPREDUCE-4099

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

Page 15: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

15

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

“We have already found and fix many cases, however it seems exist many other cases.”

HBase / HBASE-6147

“This has become quite messy, sigh.”

Hadoop Map/Reduce / MAPREDUCE-4819

“Great catch, Sid! Apologies for missing the race condition.”

Hadoop Map/Reduce / MAPREDUCE-4099

“We [prefer] debug crashes instead of hanging jobs.”

Hadoop Map/Reduce / MAPREDUCE-3634

Page 16: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

16

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14[2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14[3] Leesatapornwongsa. TaxDC. In ASPLOS’16

DCbugs need to be tackled

• Common in distributed systems [1, 2, 3]

• Difficult to avoid, expose and diagnose

“That is one monster of a race!”

Hadoop Map/Reduce / MAPREDUCE-3274

“There isn’t a week going by without new bugs about races.”

HBase / HBASE-4397

“We have already found and fix many cases, however it seems exist many other cases.”

HBase / HBASE-6147

“This has become quite messy, sigh.”

Hadoop Map/Reduce / MAPREDUCE-4819

“Great catch, Sid! Apologies for missing the race condition.”

Hadoop Map/Reduce / MAPREDUCE-4099

“We [prefer] debug crashes instead of hanging jobs.”

Hadoop Map/Reduce / MAPREDUCE-3634

Can we detect DCbugs before they manifest?

Page 17: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

17

Previous work

• Model checking

– Work on abstracted models

– Face state-space explosion issue

Page 18: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

18

Our idea

• Follow the philosophy of traditional concurrency bug detection

Page 19: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

19

Our idea

• Follow the philosophy of traditional concurrency bug detection

Machine 2Machine 1 Machine 3 Machine 4

Page 20: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

20

Our idea

• Follow the philosophy of traditional concurrency bug detection

Machine 2Machine 1 Machine 3 Machine 4

Page 21: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

21

Our idea

• Follow the philosophy of traditional concurrency bug detection

Machine 2Machine 1 Machine 3 Machine 4

Page 22: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

22

Example

BA C

Page 23: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

23

Example

BA C

B

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

//RPC thread

Task getTask(jID){

...

return jMap.get(jID);

}

Page 24: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

24

Local concurrency bug detection

Page 25: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

25

Local concurrency bug detection

Page 26: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

26

Local concurrency bug detection

Is the problem solved?

Page 27: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

27

Local concurrency bug detection

Trace

T1 T2 T3

Page 28: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

28

Local concurrency bug detection

Trace

T1 T2 T3

C1

C1: How to handle the hugeamount of mem accesses?

Challenges

Page 29: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

29

Local concurrency bug detection

.

.

.

Trace HB

T1 T2 T3

C1

C1: How to handle the hugeamount of mem accesses?

Challenges

Page 30: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

30

Local concurrency bug detection

.

.

.

Trace HB

T1 T2 T3

C1

C1: How to handle the hugeamount of mem accesses?

Challenges

rw

Page 31: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

31

Local concurrency bug detection

.

.

.

Trace HB

T1 T2 T3

C1

C2

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

Challenges

rw

Page 32: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

32

Local concurrency bug detection

.

.

.

Trace TriageHB

T1 T2 T3 .

.

.

C1

C2

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

Challenges

assert(r)

rw

rw

Page 33: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

33

Local concurrency bug detection

.

.

.

Trace TriageHB

T1 T2 T3 .

.

.

C1

C2

C3

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

Challenges

assert(r)

rw

rw

Page 34: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

34

Local concurrency bug detection

.

.

.

T1 T2 T3 .

.

.

.

.

.

Trace Triage TriggerHB

C1

C2

C3

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

Challenges

+sleeprw

rw

assert(r)

Page 35: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

35

Local concurrency bug detection

.

.

.

Trace Triage TriggerHB

T1 T2 T3 .

.

.

.

.

.

C1

C2

C3

C4

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

+sleeprw

assert(r)

rw

Page 36: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

36

Contribution

• A comprehensive HB Model for distributed systems

Page 37: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

37

Contribution

Trace Triage TriggerHB

C1

C2

C3

C4

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-before model?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

ChallengesSolved by

DCatch

• A comprehensive HB Model for distributed systems

• DCatch tool detects DCbugs from correct runs

Page 38: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

38

Contribution

• A comprehensive HB Model for distributed systems

• DCatch tool detects DCbugs from correct runs

• Evaluate on 4 systems

• Report 32 DCbugs, with 20 of them being truly harmful

Page 39: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

39

Outline

• Motivation

• DCatch Happens-before Model

• DCatch tool

• Evaluation

• Conclusion

Page 40: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

40

DCatch Happens-before Model

HMaster

Thd

w

Page 41: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

41

DCatch Happens-before Model

Thd

HMaster

Thd

w

Page 42: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

42

DCatch Happens-before Model

ThdThd

HMaster HRegionServer

Thd

w

Page 43: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

43

DCatch Happens-before Model

Thd Event threadThd

HMaster

e

HRegionServer

Thd

w

Page 44: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

44

DCatch Happens-before Model

Thd Event threadThd

HMaster

e

HRegionServer

Thd

ZK Coordinator

w

Page 45: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

45

DCatch Happens-before Model

ThdThd Event threadThd

HMaster

e

HRegionServer

Thd

ZK Coordinator

w

Page 46: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

46

DCatch Happens-before Model

ThdThd Event threadThd

HMaster

e

HRegionServer

Thd

ZK Coordinator

w

r

Page 47: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

47

DCatch Happens-before Model

ThdThd Event threadThd

HMaster

e

HRegionServer

Thd

ZK Coordinator

w

r

Where is HB model for distributed systems?

Page 48: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

48

DCatch Happens-before Model

ThdThd Eve thdThd

HMaster HRegionServer

Thd

ZK Coordinator

w

r

Page 49: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

49

DCatch Happens-before Model

Dist.Loc.

ThdThd Eve thdThd

HMaster HRegionServer

Thd

ZK Coordinator

w

r

Dist.

Page 50: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

50

DCatch Happens-before Model

Dist.Loc.

Async.

Sync.ThdThd Eve thdThd

HMaster HRegionServer

Thd

ZK Coordinator

w

r

Dist.Async.

Page 51: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

51

DCatch Happens-before Model

Stand.

Dist.Loc.Cust.

Async.

Sync.ThdThd Eve thdThd

HMaster HRegionServer

Thd

ZK Coordinator

w

r

Dist.Async.

Cust.

Page 52: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

52

Distributed rules

DistributedLocal

StandardCustom

Async.

Sync.

Page 53: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

53

Distributed rule #1

• Logical time clock (Leslie Lamport, 1978)

Send

Recv

Machine1 Machine2

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

Page 54: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

54

Distributed rule #1

• Logical time clock (Leslie Lamport, 1978)

Send

Recv

Machine1 Machine2

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

Page 55: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

55

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Asy

nch

.

Customize

Syn

ch.

RPC

Socket

Page 56: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

56

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Asy

nch

.

Customize

Syn

ch.

RPC

Socket

waiting

Page 57: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

57

Distributed rule #2

RPC-call

RPC-begin

Machine1 Machine2

RPC-rtRPC-end

Standard

Asy

nch

.

Customize

Syn

ch.

RPC

Socket

waiting

Page 58: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

58

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

}

...

In multi-threaded systems:

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

Page 59: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

59

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

}

...

In multi-threaded systems:

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

Page 60: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

60

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!flag){

}

...

In multi-threaded systems:

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

In distributed systems:

Page 61: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

61

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!getFlag()){

}

...

//Thread2

bool getFlag(){

return flag;

}

Machine BMachine A

In distributed systems:

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

Page 62: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

62

Distributed rule #3

//Thread1

flag = True;

//Thread

while(!getFlag()){

}

...

//Thread2

bool getFlag(){

return flag;

}

Machine BMachine A

In distributed systems:

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

Dist. while-loop

Page 63: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

63

Distributed rule #4

ThdThd Eve thdThdThd

ZK Coordinator

w

r

Standard

Asy

nch

.

Customize

Syn

ch.

Socket

RPC

ZooKeeper Service

Page 64: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

64

Distributed rule #4

Standard

Asy

nch

.

Customize

Syn

ch.

ZooKeeper Service

Socket

RPCThdThd Eve thdThdThd

ZK Coordinator

w

r

Page 65: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

65

Distributed rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

sRPCDist. While-loop

SocketZookeeper Service

Page 66: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

66

Local rules

DistributedLocal

Page 67: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

67

Local rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

s

Event-relatedn/a

Page 68: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

68

Local rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

s

Event-related

Thread fork/joinWhile-loop

n/a

Page 69: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

69

Local rules

StandardCustomized

Asy

nch

ron

ou

sSy

nch

ron

ou

sThread fork/joinWhile-loop

Event-relatedn/a

Page 70: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

70

Outline

• Motivation

• DCatch Happens-before Model

• DCatch tool

• Evaluation

• Conclusion

Page 71: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

71

Triage TriggerTrace HB

Page 72: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

72

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Selective tracing: only mem accesses inEvent/message handlers and their callee.

Page 73: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

73

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Local Distributed

[1] Raychev. Effective Race Detection for Event-Driven Programs. In OOPSLA’13

Page 74: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

74

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Machine B

//RPC thread

Task getTask(jID){

...

return jMap.get(jID);

}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

Page 75: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

75

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Machine BMachine A

//RPC thread

Task getTask(jID){

...

return jMap.get(jID);

}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

while(!getTask(jID)){

}

Page 76: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

76

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Machine BMachine A

//RPC thread

Task getTask(jID){

...

return jMap.get(jID);

}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

while(!getTask(jID)){

}

Page 77: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

77

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Machine BMachine A

//RPC thread

Task getTask(jID){

...

return jMap.get(jID);

}

//UnReg thread

void unReg(jID){

jMap.remove(jID);

....

}

while(!getTask(jID)){

}

hang

Page 78: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

78

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd ThdEvent thread

e2

e1

w

r

Page 79: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

79

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd ThdEvent thread

e2

e1

w

r +sleep

Page 80: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

80

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd ThdEvent thread

e2

e1

w

r

+sleep

Page 81: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

81

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd

Machine A

Thd

Machine C

Thd ThdEvent thread

e2

e1

w

r

Machine B

Page 82: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

82

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd

Machine A

Thd

Machine C

Thd ThdEvent thread

e2

e1

w

r

Machine B

+sleep

+sleep

Page 83: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

83

C1: How to handle the hugeamount of mem accesses?

C2: What’s the happens-beforemodel?

C3: How to estimate thedistributed impact of a race?

C4: How to trigger withdistributed time manipulation?

Challenges

Triage TriggerTrace HB

Thd

Machine A

Thd

Machine C

Thd ThdEvent thread

e2

e1

w

r

Machine B

+sleep

+sleep+sleep

+sleep

Page 84: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

84

Outline

• Motivation

• DCatch Happens-before Model

• DCatch tool

• Evaluation

• Conclusion

Page 85: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

85

• Benchmarks:

– 7 real-world DCbugs from TaxDC[1]

– 4 distributed systems

Methodology

[1] Leesatapornwongsa. TaxDC. In ASPLOS’16

Page 86: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

86

Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

Page 87: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

87

Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

Page 88: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

88

Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

= 12 + 8

Page 89: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

89

Overall resultsBugID Detected? #. Bugs #. Benign #. false-pos

CA-1011 ✔ 3 0 0

HB-4539 ✔ 3 0 1

HB-4729 ✔ 4 1 0

MR-3274

✔ 2 0 4

MR-4637

✔ 1 2 4

ZK-1144 ✔ 5 1 1

ZK-1270 ✔ 6 2 0

Total 20 5 7

Page 90: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

90

Other results in our paper

• Performance overhead

• Trace compositions

• HB model impact

– False-positive

– False-negatives

• …

Page 91: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

91

Outline

• Motivation

• DCatch Happens-before Model

• DCatch tool

• Evaluation

• Conclusion

Page 92: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

92

Conclusion

• A HB Model for distributed systems

• DCatch detects DCbugs from correct runs with low false positive rates.

Trace Triage TriggerHB

C1

C2

C3

C4

Local Distributed

Page 93: DCatch: Automatically Detecting Distributed Concurrency ...shanlu/paper/Dcatch-final.pdfMachine1 Machine2 Standard nch. Customize ch. Socket. 54 Distributed rule #1 •Logical time

93

Thank you!Q&A