rex: replication at the speed of multi-core zhenyu guo, chuntao hong, dong zhou*, mao yang, lidong...

35
Rex: Replication at the Speed of Multi- core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft Research CMU* 1

Upload: derrick-pierce

Post on 23-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

1

Rex: Replication at the Speed of Multi-core

Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang

Microsoft Research CMU*

Page 2: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

2

Tension between Replication and Multi-core

• Most applications are multi-threaded• But, to replicate, you can only use single-thread

Sacrifices performance for replication

Database

Lock ServerFile Server

Key-value Stores

Multi-core Re

plica

tion

Page 3: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

3

Rex: Replication at the Speed of Multi-core

Replication

Multi-core

Page 4: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

4

Outline

• Motivation• System Overview• Implementation• Evaluation

Page 5: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

5

State Machine Replication

• To replicate a service:1. Model as deterministic state machine2. Order requests with consensus protocol3. Execute with single-thread

Consensus Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Server

Sequential Execution Consistent StatesParallel Execution

Mul

ti-co

reServer

Server

Server

Inconsistent States

requests

Page 6: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

6

Why Multi-thread Breaks State Machine Replication

• Non-deterministic decisions: locking order, etc…• Replicas make decisions independently

Server 1

Server 2

Performance

Consistency

Page 7: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

7

Rex: Execute-Agree-Follow

Primary

TracesConsensus

Traces

Traces

Secondary

Secondary

Execute Agree Follow

Page 8: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

8

Programming With Rex

1. Model app as RexRSM2. Use Rex to make non-deterministic decisions

• RexLocks, RexCond, …• RexTimeStamp, RexRand, etc.

Page 9: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

9

Outline

• Motivation• System Overview• Implementation• Evaluation

Page 10: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

10

Normal Execution: Primary

2lockA

3unlockA

4

1request 1

1 request 2

2 lockA

reply 1 3

4

unlockA

reply 2

Primary

Trace:(t1, 1, request 1)…Causal edge((t1, 3)->(t2, 2))…(t1, 4, reply 1)...…

Page 11: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

11

Normal Execution: Secondary

1request 11 request 2

lockA2lockA

3unlockA

4reply 1

Secondary

(t1, 1, request 1)…Causal edge((t1, 3)->(t2, 2))…(t1, 4, reply 1)...…

3

4

unlockA

reply 2

2

waited event

Page 12: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

12

Primary Failover

• Primary• restart from checkpoint • rejoin

• Secondary• upgrade to primary • switch replay -> record

Committed

Uncommitted

Crash

Page 13: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

13

Unique Challenges: Integrating Replication and Record/Replay

• Inconsistency cut• “Holes” in logs• Causal edge pruning• Hybrid execution• …

Page 14: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

14

The Inconsistent Cut Problem

• Collects logs at each thread asynchronously• Inconsistent cut contains destination nodes without source node• Problem: not be able to follow

t1 t2

Inconsistent cut

ABC Reply

Page 15: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

15

Solving Inconsistent Cut Problem

• Define consensus on last consistent cut• Drop C1-C2 when primary fail• Reply only when reply contained in a committed consistent cut

Use vector clock to track

t1 t2

C1

ABC Reply

C2

Page 16: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

16

Outline

• Motivation• System Overview• Implementation• Evaluation

Page 17: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

17

Experiment Setup

• Real-world Applications

• Micro-benchmark: for lock contention ratio• Servers: 12-core, 24-thread, 10GE network

App Description

Thumbnail Generating and storing thumbnails

XLock Lock server similar to Chubby

File Server File server

Kyoto Cabinet Key-value store

LevelDB Local storage behind BigTable

MemCached Cache server

Page 18: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

18

Performance Overview

thumbnail

xlock

fileserve

r

memcach

ed

kyoto

cabinet

leveldb

0

5

10

15

20

25

serial nonreplicated Rex

Max

Spe

edup

• Rex scales as nonreplicated• <24% overhead

Page 19: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

19

LevelDB in Detail

1 2 4 8 16 24 320

1

2

3

4

5

6

0

10

20

30

40

50

60

70

nonreplicated Rex waited events

Spee

dup

number of threads

thou

sand

eve

nts /

sec

# cores

Waited events grows with # threads, so does overhead

overhead drops with more threads to schedule

Page 20: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

20

Lock Conflict Ratio

0.001 0.01 0.05 0.1 0.2 0.5 10

2

4

6

8

10

12

native Rex

Lock Conflict Probability

Thro

ughp

ut (t

hous

ands

)

Overhead < 15%

Page 21: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

21

Summary

• Rex: execute-agree-follow• Applied to six real-world applications• Preserves scalability and low overhead

Page 22: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

22

Thanks!Q&A

Page 23: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

23

Backups

Page 24: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

24

Dealing with Data Races

• Reply logging & compare• Resource version checking• Lock-free data structures: NATIVE_EXEC

• Experience shows that getting rid of data races is doable

Page 25: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

25

Workloads

• Thumbnail:• 1 pic per request

• K-V stores: • 1M pairs• 16 byte key, 100 byte value• 10% write

• File system:• 16KB random requests• 20% write

• Xlock:• 90% lease renew• 100B – 5KB file

Page 26: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

26

Lock Granularity

0.001 0.01 0.05 0.10

2

4

6

8

10

12

14

100% 80% 60% 10%

Conflict Ratio

Spee

dup

Page 27: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

27

Request Granularity

1 2 4 8 120

2

4

6

8

10

12

14

0.1ms 1ms 10ms

Number of threads

Thro

ughp

ut(t

hous

ands

)

10% computation in locks1% conflict ratio

Page 28: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

28

Experimental Results: Scalability

Page 29: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

29

Causal Events & Performance

Page 30: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

30

Improving Performance: Causal Edge Pruning with Vector Clock

• More causal edges, more overhead• Causal edge pruning: trades primary performance for secondary

t2

Reduces 58% ~ 99% causal edges

Thread t1

Lock(L2)

Thread t2

UnLock(L1) 1

1

2

2UnLock(L2)

Lock(L1)

Page 31: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

31

Replicated State Machine

Page 32: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

32

Rex: Causal Order Replication

Page 33: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

33

Correctness

• Correctness guaranteed by:1. Captures all non-determinism with Rex2. Consensus on traces3. Agreed trace is a continuous sequence (no holes)

Page 34: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

34

Inconsistent Cut: Why Is It Bad?

Trace: t1 unlock -> t2 lock -> t2 unlock -> t3 lock reply: 0Replay: t1 unlock -> t3 lock -> t3 unlock -> t2 lock reply: 1

Should we reply 0 or 1?

t1 t2

11

t3

1

2

LockX=0

Unlock

2

LockRead(X)Unlock

Reply

LockX=1

UnlockCrash

Page 35: Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1

35

Inconsistent Cut: Solving the Reply Problem

• Reply only when reply and all its dependencies are committed• Use a vector clock to detect

t1 t2

Secondary

11

t3

1

2

2 Reply(1, 1, 0)

(1, 0, 0)

(0, 0, 0)

(1, 1, 0)

Cut1(0, 2, 1)

Cut2(1, 2, 2)