c3: cutting tail latency in cloud data stores via adaptive ......uses history of read latencies and...

40
C3: Cutting Tail Latency in Cloud Data Stores via Adaptive Replica Selection Lalith Suresh (TU Berlin) with Marco Canini (UCL), Stefan Schmid, Anja Feldmann (TU Berlin)

Upload: others

Post on 07-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

C3: Cutting Tail Latency inCloud Data Stores via

Adaptive Replica Selection

Lalith Suresh(TU Berlin)

with Marco Canini (UCL), Stefan Schmid, Anja Feldmann (TU Berlin)

Page 2: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

2

OneUser Request

Tens to Thousandsof data accesses

Tail-latency matters

Page 3: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

3

For 100 leaf servers, 99th percentile latency will reflect in 63% of user requests!

OneUser Request

Tail-latency matters

Tens to Thousandsof data accesses

Page 4: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

4

Server performance fluctuations are the norm

Queueingdelays

Skewed access patterns

CDF

Resource contention

Background activities

Page 5: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

Effectiveness of replica selection in reducing tail latency?

5

?Client

Server

Server

Server

Request

Page 6: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

Replica Selection Challenges

6

Page 7: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

Replica Selection Challenges

• Service-time variations

7

RequestClient

Server

Server

Server

4 ms

5 ms

30 ms

Page 8: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

Replica Selection Challenges

• Herd behavior and load oscillations

8

Request

Request

RequestClient

Client

Client

Server

Server

Server

Page 9: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

9

Impact of Replica Selection in Practice?

Dynamic Snitching

Uses history of read latencies and I/O load for replica selection

Page 10: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

10

Experimental Setup

• Cassandra cluster on Amazon EC2• 15 nodes, m1.xlarge instances• Read-heavy workload with YCSB (120 threads)• 500M 1KB records (larger than memory)• Zipfian key access pattern

Page 11: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

11

Cassandra Load Profile

Page 12: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

12

Also observed that 99.9th percentile latency ~ 10x median latency

Cassandra Load Profile

Page 13: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

13

Load Conditioning in our Approach

Page 14: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

C3 Adaptive replica selection mechanism that is robust to service time heterogeinity

14

Page 15: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

C3 • Replica Ranking• Distributed Rate Control

15

Page 16: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

C3 • Replica Ranking• Distributed Rate Control

16

Page 17: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

17

ClientServer

Client

Client

Server

µ-1 = 2 ms

µ-1 = 6 ms

Page 18: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

18

ClientServer

Client

Client

Server

Balance product of queue-size and service time{ q · µ-1 }

µ-1 = 2 ms

µ-1 = 6 ms

Page 19: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

19

Server-side Feedback

Servers piggyback {qs } and {µμ𝒔#𝟏} in every response

Client Server

{ qs , µμ𝒔#𝟏 }

Page 20: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

20

Server-side Feedback

• Concurrency compensation

Servers piggyback {qs } and {µμ𝒔#𝟏} in every response

Page 21: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

21

Server-side Feedback

• Concurrency compensation

𝑞&' = 1 +  𝑜𝑠'. 𝑤 + 𝑞'

Servers piggyback {qs } and {µμ𝒔#𝟏} in every response

Outstanding requests Feedback

Page 22: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

22

Select server with min  𝑞&'  . µμ𝒔#𝟏 ?

Page 23: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

23

Select server with min  𝑞&'  . µμ𝒔#𝟏 ?

Server

Server

µ-1 = 4 ms

µ-1 = 20 ms

20 requests

100 requests!

• Potentially long queue sizes• What if a GC pause happens?

Page 24: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

24

Penalizing Long Queues

Server

Server

µ-1 = 4 ms

µ-1 = 20 ms

20 requests

35 requests

Select server with min   𝑞&'  . µμ𝒔#𝟏b

b = 3

Page 25: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

C3 • Replica Ranking• Distributed Rate Control

25

Page 26: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

26

Need for rate control

Replica ranking insufficient

• Avoid saturating individual servers?

• Non-internal sources of performance fluctuations?

Page 27: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

27

Cubic Rate Control

• Clients adjust sending rates according tocubic function

• If receive rate isn’t increasing further, multiplicatively decrease

Page 28: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

28

Putting everythingtogether

Server

Server

1000  req/s

2000  req/s

Rate Limiters

Replica group

scheduler

Sort replicasby score

C3 Client

{ Feedback }

Page 29: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

29

Implementation in Cassandra

Details in the paper!

Page 30: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

30

Evaluation

Amazon EC2

Controlled Testbed

Simulations

Page 31: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

31

Evaluation

Amazon EC2

• 15 node Cassandra cluster• M1.xlarge• Workloads generated using YCSB (120 threads)• Read-heavy, update-heavy, read-only• 500M 1KB records dataset (larger than memory)• Compare against Cassandra’s Dynamic Snitching (DS)

Page 32: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

32

Lower is better

Page 33: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

33

2x – 3x improved99.9 percentilelatencies

Also improves median and mean latencies

Page 34: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

34

2x – 3x improved99.9 percentilelatencies

26% - 43% improved throughput

Page 35: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

35

Takeaway:

C3 does not tradeoff throughput for latency

Page 36: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

36

How does C3 react to dynamic workload changes?

• Begin with 80 read-heavy workload generators

• 40 update-heavy generators join the system after 640s

• Observe latency profile with and without C3

Page 37: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

37

Latency profile degrades gracefully with C3

Takeaway: C3 reacts effectively to dynamic workloads

Page 38: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

38

Summary of other results

Higher system load

Skewed record sizes

SSDs instead of HDDs

> 3x better 99.9th

percentile latency

50% higher throughputthan with DS

Page 39: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

39

Ongoing work

• Tests at SoundCloud and Spotify

• Stability analysis of C3

• Alternative rate adaptation algorithms

• Token aware Cassandra clients

Page 40: C3: Cutting Tail Latency in Cloud Data Stores via Adaptive ......Uses history of read latencies and I/O load for replica selection. 10 Experimental Setup • Cassandra cluster on Amazon

40

?

Client

Server

Server

Server

C3Replica Ranking

+ Dist. Rate Control

Summary