predicting replicated database scalability

Predicting Replicated Database Scalability

Sameh Elnikety, Microsoft Research Steven Dropsho, Google Inc.Emmanuel Cecchet, Univ. of Mass.Willy Zwaenepoel, EPFL

• Environment– E-commerce website– DB throughput is 500 tps

• Is 5000 tps achievable?– Yes: use 10 replicas– Yes: use 16 replicas – No: faster machines needed

• How tx workload scales on replicated db?

Motivation

SingleDBMS

2

Multi-Master Single-Master

Replica 2

Replica 1

Replica 3

3

Slave 1

Master

Slave 2

Background: Multi-Master

Replica 2

Replica 1

Replica 3

StandaloneDBMS

Load Balancer

4

Read Tx

Replica 2

Replica 1

Replica 3

Load Balancer

T

5

Read tx does not change DB state

Read tx does not change DB state

Update Tx

Replica 2

Replica 1

Replica 3

CertLoad

Balancer

TTwsws wswswswswsws

6

Update tx changesDB state

Update tx changesDB state

Additional Replica

Replica 2

Replica 1

Replica 3

Load Balancer T wsws

Replica 3

7

Replica 4

Cert

wswswsws

• Standalone DBMS– Service demands

• Multi-master system– Service demands– Queuing model

• Experimental validation

Coming Up …

8

• Required– readonly tx: R – update tx: W

• Transaction load– readonly tx: R

– update tx: W / (1 - A1)

Standalone DBMS

SingleDBMS

Abort probability is A1 Submit W / (1 - A1) update tx

Commited tx: WAborted tx: W ∙ A1 / (1- A1)

Abort probability is A1 Submit W / (1 - A1) update tx

Commited tx: WAborted tx: W ∙ A1 / (1- A1) 9

Standalone DBMS

SingleDBMS

1

(1)(1 )

WLoad R rc wc

A

10

• Required– readonly tx: R – update tx: W

• Transaction load– readonly tx: R

– update tx: W / (1 - A1)

Service Demand

1

(1)(1 )

WLoad R rc wc

A

1

(1)(1 )

PwD Pr rc wc

A

11

• Required (whole system of N replicas)– Readonly tx: N ∙ R – Update tx: N ∙ W

• Transaction load per replica– Readonly tx: R

– Update tx: W / (1 - AN)

– Writeset: W ∙ (N - 1)

Multi-Master with N Replicas

( 1)(1 )

( )N

MM

WR rc wc W N ws

ALoad N

12

MM Service Demand

( 1)(1 )

( )N

MM

WR rc wc W N ws

ALoad N

( )(1 )

1)N

MM

PwN Pr rc wc Pw ws

AD N

13Explosive cost!

Compare: Standalone vs MM

( )(1 )

1)N

MM

PwN Pr rc wc Pw ws

AD N

Explosive cost!

1

(1)(1 )

PwD Pr rc wc

A

14

• Standalone:

• Multi-Master:

Readonly Workload

( )(1 )

1)N

MM

PwN Pr rc wc Pw ws

AD N

Explosive cost!

1

(1)(1 )

PwD Pr rc wc

A

15

• Standalone:

• Multi-Master:

Update Workload

( )(1 )

1)N

MM

PwN Pr rc wc Pw ws

AD N

Explosive cost!

1

(1)(1 )

PwD Pr rc wc

A

16

• Standalone:

• Multi-Master:

Closed-Loop Queuing Model

Replica i

LB

LB

LB

...

CPU

Disk

TT

TT

TT

Cert

Cert

Cert

Think time

Load balancer

& network

delay

Certifier delay

Pw..

.

...

N replicas

17

• Standard algorithm

• Iterates over the number of clients

• Inputs:– Number of clients– Service demand at service centers– Delay time at delay centers

• Outputs:– Response time– Throughput

Mean Value Analysis (MVA)

18

Using the Model

Replica i

LB

LB

LB

...

CPU

Disk

TT

TT

TT

Cert

Cert

Cert

Think time

Load balancer

& network

delay

Certifier delay

Pw..

.

...

N replicas

19

• Copy of database

• Log all txs, (Pr : Pw)

• Python script replays txs– Readonly (rc)– Updates (wc)

• Writesets– Instrument db with triggers– Play txs to log writesets– Play writesets (ws)

Standalone Profiling (Offline)

20

MM Service Demand

( )(1 )

1)N

MM

PwN Pr rc wc Pw ws

AD N

21Explosive cost!

Abort Probability

( )

(1)

1(1 ) (1 )

CW N

LN

NA A

22

Using the Model

Replica i

LB

LB

LB

...

CPU

Disk

TT

TT

TT

Cert

Cert

Cert

Think time

Load balancer

& network

delay

Certifier delay

Pw..

.

...

N replicas

# clients, think time

1.5 ∙ fsync()

1 ms

23

• Compare– Measured performance vs model predictions

• Environment– Linux cluster running PostgreSQL

• TPC-W workload– Browsing (5% update txs)– Shopping (20% update txs)– Ordering (50% update txs)

• RUBiS workload– Browsing (0% update txs)– Bidding (20% update txs)

Experimental Validation

24

Multi-Master TPC-W Performance

Throughput Response time

25

26

Browsing, 5% u

15.7 X

Ordering, 50% u6.7 X15%

Multi-Master RUBiS Performance

Throughput Response time

27

28

Browsing, 0% u

16 X

bidding, 20% u

3.4 X

• Database system– Snapshot isolation– No hotspots– Low abort rates

• Server system– Scalable server (no thrashing)

• Queuing model & MVA– Exponential distribution for service demands

Model Assumptions

29

• Models– Single-Master– Multi-Master

• Experimental results– TPC-W– RUBiS

• Sensitivity analysis– Abort rates– Certifier delay

Checkout the Paper

30

Urgaonkar, Pacifici, Shenoy, Spreitzer, Tantawi.

“An analytical model for multi-tier internet services and its applications.” Sigmetrics 2005.

Related Work

31

• Derived an analytical model– Predicts workload scalability

• Implemented replicated systems– Multi-master– Single-master

• Experimental validation– TPC-W– RUBiS– Throughput predictions match within 15%

Conclusions

32

• Questions?

Danke Schön!

33

Predicting Replicated Database Scalability

predicting replicated database scalability

Documents

n r update tx

requiredreadonly tx

a1 update txcommited

replicareadonly tx

waborted tx

replica ilblblb

update txsbidding

update txsordering