adaptive replication and partitioning in data systemsmtabebe/resources/... · hierarchical p2p file...

Adaptive Replication and Partitioning

in Data Systems

Brad Glasbergen, Michael Abebe,

Khuzaima DaudjeeMiddleware 2018

Data Systems Group

BA .APAGE 2

S1

A

Single Node Architecture

BOverloaded

PAGE 6


How to scale beyond a single node?

Replicate and partition

PAGE 7




PAGE 8

S1

A

Replicated Architecture

B

S2 S3

A B A B

Handle more requests PAGE 9

S1

A


B

S2 S3

A B A B

Cost of coordinationPAGE 10

S1

A


S2 S3

A B B

How many replicas?PAGE 11

S1

A


S2 S3

A B B

W[ A ] W[ A ]

PAGE 12

S1

A


S2 S3

AB B

W[ A ] W[ A ]

Where to place replicas?PAGE 13

S1

A


S2 S3

AB B

How to propagate updates?

W[ A ]

PAGE 14

● (A)synchronous● Consistency

● How many replicas?

● Where to place replicas?

● How to propagate updates?

Replication Decisions

PAGE 15




PAGE 16

Partitioned Architecture

Distributes requests S1

A

S2

B

PAGE 17


S1 S2

BA

PAGE 18


How to form partitions? S1

A1

S2

BA2

PAGE 19


Where to place partitions? S1

A1

S2

BA2

PAGE 20


How to execute multi-partition operations?

S1

A

S2

B

PAGE 21

Partitioning Decisions

PAGE 22

● How to form partitions? ● Where to place partitions?

● How to execute multi-partition operations?

Where to place partitions?

S1

A1

S3

BA2

PAGE 23

How to make a partitioning or replication decision when access patterns change?

Static Decisions

Why do access patterns change?

PAGE 24

Why do accesses change?

Humans have follow-the-sun cycles

PAGE 25


Load bursts

PAGE 26


Shifting hot-spotsPAGE 27


Static Decisions

Adaptively replicate and partition

PAGE 28


S1

A1

S2

BA2

PAGE 29


S1

A1

S2

A2B

PAGE 30

How many replicas?

S1

A1

S2

A2B

R[ B ]

PAGE 31

How many replicas?

S1

A1

S2

A2B

R[ B ]

B

PAGE 32

● Adaptive Replication

● Adaptive Partitioning

● Outlook

Road Map

PAGE 33

Adaptive Replication

PAGE 34





PAGE 35

Adaptive Replication ● Decentralized

● Geo-Distributed

● Caching

● Availability

PAGE 36

S2

Adaptive Replication (ADR)

(Wolfson et al., TODS 1997)

S1 S3

S4 S5 S6

PAGE 37



S2 S1 S3

PAGE 38



A

S2 S1 S3

PAGE 39

S2

Local Read


A

R[A] S1 S3

PAGE 40

Remote Read


A

S2 S1 S3

PAGE 41

R[A] S2

Remote Read


A

S1 S3

PAGE 42

R[A] S2

Remote Read


A

S1 S3

More Messages!

PAGE 43

R[A] S2

Replication for Reads


A

S1 S3

A’

PAGE 44

W[A] S2

No Free Lunch


A

S1 S3

A’

PAGE 45

W[A] S2

No Free Lunch


A

S1 S3

A’

Reduces read cost,Increases write cost

PAGE 46

When to Replicate?


A

S2 S1 S3

PAGE 47

S2

When to Replicate?


A

Reads: 15

Writes: 5

Writes: 5

S1 S3

PAGE 48

When to Replicate?


A

Reads: 0

Writes: 5

Writes: 5

A’

Writes: 10

S2 S1 S3

PAGE 49

When to Replicate?


A

Reads: 0

Writes: 5

Writes: 5

A’

Writes: 10

5 Fewer Messages!

S2 S1 S3

PAGE 50

When to Stop Replicating?


AA’

S2 S1 S3

PAGE 51

S2



AA’

Writes: 20Reads: 5

Reads: 5

S1 S3

PAGE 52

S2



A

Writes: 20Reads: 5

Reads: 5

10 Fewer Messages!

S1 S3

PAGE 53

Decentralized Decisions


A

S2 S1 S3

Replicate?

PAGE 54

Adapts to Changing Workloads


A

Replicate?

S2 S1 S3

More Reads

A’

PAGE 55

S2

Extensions (Network Topology)


S1 S3

S4 S5 S6

PAGE 56

S2



S1 S3

S4 S5 S6

PAGE 57

S2



S1 S3

S4 S5 S6

Form Tree

PAGE 58

Extensions (Blocks)


S2 S1 S3

A

B

C

⅓ R[A] + ⅓ R[B] + ⅓ R[C] ⅓ W[A] + ⅓ W[B] + ⅓ W[C]

A’

B’

C’

PAGE 59

Peer-to-Peer File Systems

PAGE 60

Hash-Based P2P File System

h(s) = 275

PAGE 61


PAGE 62


PAGE 63


PAGE 64

Hash-Based P2P FSBalanced load, but poor access locality!

PAGE 65

Hierarchical P2P File Systems

PAGE 66


PAGE 67


PAGE 68


PAGE 69

Hierarchical P2P File SystemGood locality,but poor balance

PAGE 70

(Gopalakrishnan et al., ICDCS’04)

Overloaded

Moderate Load

Light Load

Locality and Load Balance

PAGE 71


Overloaded

Moderate Load

Light Load


PAGE 72



PAGE 73

No replication


Replicate to balance load!


PAGE 74



PAGE 75



Replicate to balance load!

PAGE 76



PAGE 77

Decentralized Replicas


Writes?

PAGE 78

● How many replicas?Decentralized decision

● Where to place replicas?At the requester

● How to propagate updates?ADR: Synchronous, P2P: read-only


(Gopalakrishnan et al., ICDCS’04)(Wolfson et al., TODS 1997)PAGE 79

Global Scale Replication

022 73

Average Latency31 ms

PAGE 81


185 159

142189

136022 73


PAGE 83


136

185 159

142180

0022 73

Average Latency 95 ms

PAGE 85


301136

159

185 0

169

0022 73


133

PAGE 86

Minimize cost of access

Take me to your leader!(Sharov et al., VLDB 2015)


GPlacer (Zakhary et al., EDBT 2018)

Place data around the world

PAGE 87

Data Distribution Model

(Sharov et al., VLDB 2015)

S1

A B

S4

A B

S2

A B

S3

A B

Replicas

PAGE 88



S1

A B

S4

a b

S2

A B

S3

A B

Read-Write Replicas

Read ReplicasState

changesPAGE 89



S1

A B

S4

a b

S2

A B

S3

A B

Read-Write Replicas

Read ReplicasLeader State

changesCoordinates

PAGE 90

RWWL

Global Replication Problem


● Select replicas

● Assign replica roles (read or read-write)

● Assign leader

PAGE 91



● Select replicas


● Assign leader

PAGE 92

Assign Leader


Leader: site that minimizes access costs

PAGE 93

+ median RTT( S1, {S1, S2, S3})

Write transaction cost


S1

A B

S4

a b

S2

A B

S3

A B

Send write to leader

Quorum writes cost = RTT(S3, S1)

PAGE 94

RWWL

Assign Leader



Client cost: RTT(client, replica) + cost( transaction )

PAGE 95

Weighting Client Cost

(Sharov et al., VLDB 2015)PAGE 96

Weighting Client Cost


2 writes2 reads

10 writes20 reads

5 writes5 reads

Assign Leader




Cost: Weighted average of client costs

PAGE 98



● Select replicas


● Assign leader

PAGE 99

Assign Replica RolesLeader: minimizes median RTT to read-write replicas


Read-write replicas:

PAGE 100

+ median RTT( S1, {S1, S2, S3})

Write transaction cost


S1

A B

S4

a b

S2

A B

S3

A B

Send write to leader

Quorum writes cost = RTT(S3, S1)

PAGE 101

RWWL

Assign Replica RolesLeader: minimizes median RTT to read-write replicas


Read-write replicas: Lowest RTT to leader

PAGE 102

● Select replicas


● Assign leader



Replica selection



Read replicas:

Leader: minimizes median RTT to read-write replicas

PAGE 104

Replica selection



Replica selection



Read replicas: Lowest RTT to clients

Replica selection



Read replicas: Lowest RTT to clients

Leader: minimizes median RTT to read-write replicas

PAGE 107

K-Means Replica selection


K-Means Replica selection


97

121

7365

88 0

0

0

220


PAGE 109

219

169

K-Means Replica selectionSelect replicas


Assign leader and read-write replicas

PAGE 110

Leaderless Protocols

(Zakhary et al., EDBT 2018)

S1

A B

S4

a b

S2

A B

S3

A B

Any quorum member can coordinate

PAGE 111

Hinted Hand off

301136

159(Zakhary et al., EDBT 2018)

PAGE 112

Hinted Hand off

301136

159

295 < 301

(Zakhary et al., EDBT 2018)PAGE 113

Hinted Hand off

(Zakhary et al., EDBT 2018)

Hand off request from S1 to S2 if:

cost( S1 ) > RTT( S1, S2) + cost( S2 )

PAGE 114

cost( S ) = cost of executing request at S

● How many replicas?Centralized, given client workload

● Where to place replicas?Heuristic (clustering)

● How to propagate updates?Quorums / Leader-based (Sharov)


(Zakhary et al., EDBT 2018)(Sharov et al., VLDB 2015)PAGE 115

Intra-Region Latency

(Sivasubramanian et al., WWW 2005)

100+ ms

PAGE 116

Edge Nodes

(Sivasubramanian et al., WWW 2005)PAGE 117

Supports Static Data

Dynamic Data?

GlobeDB


Replication Granularity


ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

PAGE 119



Per-Record?High Overhead

PAGE 120

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne



Per-Table?Inflexible

PAGE 121

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

Access-Driven Replicas


When would these be replicated together?

PAGE 122

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne



Abryan = <r1,...rn,w1,...wn>

PAGE 123

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne



Sim(Abryan,Ajustin) ≥τ?

PAGE 124

Ajustin = <r1,...rn,w1,...wn>

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne



Shared Replication Scheme

PAGE 125

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne



Ap1 = <r1,...rn,w1,...wn>

PAGE 126

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne



Aavril = <r1,...rn,w1,...wn>

PAGE 127

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

Sim(Ap1,Aavril) ≥τ?



ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne



ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

4 Kanye West

5 Drake

6 David Guetta

7 Ed Sheeran

Transaction Processing


Origin Server:Decide Partitions, Place Replicas, Place Master

PAGE 130



W[B]

PAGE 131

M



Push updates

PAGE 132

M



R[B]

PAGE 133

M



Push updates R[B]

PAGE 134

M



M

M

Replica and Master Placement


Read Latency vs. Bandwidth


M

Read Latency vs. Bandwidth


M

Read Latency

Bandwidth

𝜶

𝜷

Master?

Master Partition Placement


Placement Heuristic


M

100% Threshold

Placement Heuristic


M

95% Threshold

Placement Heuristic


M

30% Threshold

Placement Heuristic


M

5% Threshold

Placement Heuristic


Minimize:𝛼 r + 𝛽 b

PAGE 145

M

0% Threshold

● How many replicas?Cost-based given requests

● Where to place replicas?Cost-based given requests

● How to propagate updates?Single-master, eventual consistency


(Sivasubramanian et al., WWW 2005) PAGE 146

Web Workload Characteristics

(Glasbergen et al., EDBT 2018)

Read-heavy

Cache misses are very painful

PAGE 147


(Glasbergen et al., EDBT 2018)PAGE 148

Predict(Q2)

Predictive Caching


Q1 Forward(Q1)Result(Q1)

Cache(Q2)Q2

PAGE 151

Query Patterns (TPC-W)


Building a Predictive Model


Time

Q1 Q2 Q3 Q1 Q2 Q3

Q1

Q2 Q3

1 1



Time

Q1 Q2 Q3 Q1 Q2 Q3

Q1

Q2 Q3

1 1

1



Time

Q1 Q2 Q3 Q1 Q2 Q3

Q1

Q2 Q3

1 1

1

1



Q1

Q2 Q3

5 3

5

1All executed 5 times

100% probability Q2 follows Q1

20% probability Q1 follows Q3

Finding Parameter Mappings


Predictive Caching


Q1

Q2 Q3

Predictive Caching


Q1

Q2 Q3

Predictively Cache Q2

Predictive Caching


Q1

Q2 Q3

Predictively Cache Q3

Apollo Deployment


A

PAGE 165

A

AA

A

A A

AR[B]

Apollo Deployment


AR[B]

R[C]

Invalidations


W[B]

PAGE 167

Invalidations


Invalidations Limit Cache Effectiveness

PAGE 168

Session Semantics


R[B]

W[B]

PAGE 169

Session Semantics


Good fit for web data!

R[B]

PAGE 170

● How many replicas?Predictively based on requests

● Where to place replicas?Client edge cache, predictively

● How to propagate updates?Cache updates with sessions


(Glasbergen et al., EDBT 2018) PAGE 171

Replication for Availability

Failures are common

Data systems must remain available

PAGE 175

Replication for Availability

S1

C

S3 S2 S4 S5

DA

CA B

CDB

DB

CA

Tolerating r faults requires r + 1 replicas

B B B

PAGE 176

Lower Overhead

PAGE 177

B1

ReplicasData

XOR

B2 xor (B1 xor B2) = B1

B2

B1

B2

B1

B2

B1 xor B2

B1

B2

Erasure Coding

12...k

k + 1

...

k + r

12...k

Tolerating r faults requires (k+r)/k space

k partitionsr parity

partitions

Data

PAGE 178

Erasure Coding

S1 S3 S2 S4 S5

A1A2 A3 A4

W [ A ] Encode and storek = 2, r = 2

PAGE 179

Erasure Coding

S1 S3 S2 S4 S5

A1A2 A3 A4

R [ A ] Read and decodek = 2, r = 2

PAGE 180

Erasure CodingReduces storage overhead

Requires parallel retrieval

PAGE 181

Erasure Coded StorageWhere to place data

How to access dataEC-Store (Abebe, ICDCS 2018)

PAGE 182

EC-Store Data Access

(Abebe et al., ICDCS 2018)

S1 S3 S2 S4

A1 A2 A3B1 B2 B3 C1

R[ A, B ]

R[ A, B ]Load aware

S5

R[ C ]k = 2, r = 1

PAGE 183

EC-Store Data Access


Access Strategy: Minimize cost of access

Cost of site access: load at site + I/O at site

PAGE 184

EC-Store Data Movement


S1 S3 S2 S4

A1 A2 A3B1 B2 B3 C1

R[ A, B ] R[ C ]

R[ A, B ]Load aware

S5

A3

k = 2, r = 1

PAGE 185

EC-Store Data Movement


Move data to minimize cost of future accesses and balance system load

Model access patterns to predict future accesses

PAGE 186

● How many replicas?Fault tolerance requirements

● Where to place replicas?Dynamic movement, using access costs

● How to propagate updates?Synchronous updates


(Abebe et al., ICDCS 2018) PAGE 187



● Outlook

Road Map

PAGE 188

Adaptive Partitioning

PAGE 189

● How to form partitions? ● Where to place partitions?



PAGE 190

Adaptive Partitioning ● Iterative improvements

● Partitioning per request

● Considering the overall workload ○ Heuristics

PAGE 191

Physical Database Design

A B C D

1 2 4 2

2 4 6 8

3 6 7 5

PAGE 194

A B C D

1 2 4 8

2 4 6 3

3 6 7 10

1248


PAGE 195

A B C D

1 2 4 8

2 4 6 3

3 6 7 10

12482463


PAGE 196

A B C D

1 2 4 8

2 4 6 3

3 6 7 10

1248246336710


PAGE 197

A B C D

1 2 4 82 4 6 3

3 6 7 10

SELECT AVERAGE(C) FROM R WHERE R.D > 5;


PAGE 198

1248246336710

A B C D

1 2 4 82 4 6 3

3 6 7 10


AAA

...

...

CCC

DDD

Scan

Analytic Database Design

PAGE 199


AAA...

...

CCC

DDD

Need to know what to index upfront

Index on D

Analytic Database Design

PAGE 200

Adaptive Range IndexingSELECT ... FROM R WHERER.D > 5;SELECT … FROM R WHERE R.D > 5 AND R.D < 10;SELECT … FROM R WHERE R.D > 10 AND R.D < 20;

PAGE 201

Database Cracking83

1000425122215

174221

Column D83

1000425122215

174221

Cracked Column D

Copy

(Idreos et al., CIDR 2007)PAGE 202

Indexes via Partitioning83

1000425122215

174221

Cracked Column D

SELECT … WHERE R.D > 5

(Idreos et al., CIDR 2007)PAGE 203

Indexes via Partitioning

(Idreos et al., CIDR 2007)

83

1000425122215

174221


Cracked Column D

PAGE 204



83

1000425122215

174221


Cracked Column D

PAGE 205



83

1000425122215

174221

SwapSELECT … WHERE R.D > 5

Cracked Column D

PAGE 206



53

1000425122218

174221


Cracked Column D

PAGE 207



53

1000425122218

174221


Cracked Column D

PAGE 208



5314

251222108

174221


Cracked Column D

PAGE 209



5314

251222108

174221


Cracked Column D

PAGE 210



5314

251222108

174221

SELECT … WHERE R.D > 5 AND R.D < 10

Only need to consider these

Cracked Column D

PAGE 211



53148

10171225224221


Cracked Column D

PAGE 212



53148

10171225224221

Only need to consider these


Cracked Column D

PAGE 213



53148

10171225224221


Cracked Column D

Iterative Partitioning forIndexing

PAGE 214

Cracked Column D53148

10171225224221

Advanced Cracking Methods

Distribution


Cracking: Extensions

PAGE 215

● How to form partitions?Iteratively, based on queries

● Where to place partitions?Sorted in memory

● How to execute multi-partition operations?N/A


(Idreos et al., CIDR 2007) PAGE 216

Exploratory Workloads?

App Usage Time

PAGE 217


Usage By Device

PAGE 218


Revenue By Device

PAGE 219


Devices By Country

No upfront information, need generic partitioning!

PAGE 220

Initial Partitioning (KD-Tree)Depth Limits Division

64 MB

128 MB

256 MB

512 MB

(Shanbhag et al., SoCC 2017)PAGE 221

Heterogeneous Tree

(Shanbhag et al., SoCC 2017)

Contains more attributes!

PAGE 222

Building the Partitioning


A: 0.0B: 0.0C: 0.0D: 0.0

PAGE 223



A: 1.0B: 0.0C: 0.0D: 0.0

PAGE 224



A: 1.0B: 0.5C: 0.0D: 0.0

PAGE 225



A: 1.0B: 0.5C: 0.5D: 0.0

PAGE 226



A: 1.0B: 0.5C: 0.5D: 0.5

PAGE 227



A: 1.0B: 0.75C: 0.5D: 0.5

PAGE 228

Q1 =σD≤45





Q2 =σA≧125Refine partitioning per the workload!

PAGE 230

Adaptive Partitioning: When?


Q1 =σD≤45Q1, Q2, Q3, Q1, ...

PAGE 231

Swap Operation


Q1 =σD≤45Q1, Q2, Q3, Q1, ...

PAGE 232

Rewrite Tree

Push Up Operation


Q1 =σD≤45Q1, Q2, Q3, Q1, ...

Push up

Push Up Operation


Q1 =σD≤45Q1, Q2, Q3, Q1, ...

Logical Movement

Divide and ConquerQ1 =σD≤45

Get Best Subtree

Get Best Subtree

PAGE 235(Shanbhag et al., SoCC 2017)

● How to form partitions?Upfront then iteratively, based on queries

● Where to place partitions?Rely on HDFS

● How to execute multi-partition operations?Rely on HDFS


(Shanbhag et al., SoCC 2017) PAGE 236

Exploiting Workloads● Known ahead of time

● Parameterized

● Repetitive

PAGE 238

Exploiting Workloads - OLTP

PAGE 239

Warehouse

District

Customer

Partitioning OLTPWrite [ W1, D1, C1 ]

S1

W1C1

D1

S2

W2C2

D2

PAGE 240

Partitioning OLTP

S1

W1C1

D1

S2

W2C2

D2

Write [ W1, D1, C2 ]

prepare to commit

commit

PAGE 241

Two phase commit

Per transaction partitioning

Workload based repartitioning

Partitioning OLTP

G-Store(Das et al., SoCC 2010)

L-Store (Lin et al., SIGMOD 2016)

Later in the tutorial

PAGE 242

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2

(Das et al., SoCC 2010)

W1

PAGE 243

Write[ W1, D1, C2 ]

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2


W1

Join request

PAGE 244

Write[ W1, D1, C2 ]

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2


W1

Join request

JoinedC2 Propagate

Txn ops

PAGE 245

Write[ W1, D1, C2 ]

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2


W1

Join request

JoinedC2 Propagate

Txn ops

Delete group

PAGE 246

Write[ W1, D1, C2 ]

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2


W1

Join request

JoinedC2 Propagate

Txn ops

Delete group

FreePAGE 247

Write[ W1, D1, C2 ]

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2


W1

Join request

Joined

Propagate

Txn ops

Delete group

FreePAGE 248

Write[ W1, D1, C2 ]

On demand transactional partitioning

Key Grouping

Works best when groups are small and transactions contain multiple operations


But groups are transient

PAGE 249

Localizing Execution

(Lin et al., SIGMOD 2016)

Repartition data via localization for single site execution

Dynamic partitioning based on transaction patterns

PAGE 250



S1

W1C1

D1

S2

W2C2

D2

Ownership information

D2 S2

C1 S1

C2 S2

W1 S1

W2 S2

D1 S1

PAGE 251



S1

W1C1

D1

S2

W2C2

D2

Ownership information

D2 S2

C1 S1

C2 S2

W1 S1

W2 S2

D1 S1

PAGE 252



S1

W1C1

D1

S2

W2C2

D2


W1 S1

W2 S2

D1 S1

D2 S2

C1 S1

C2 S2

PAGE 253



S1

W1C1

D1

S2

W2C2

D2

Owner request

W1 S1

W2 S2

D1 S1

D2 S2

C1 S1

C2 S2

PAGE 254




S1

W1C1

D1

S2

W2C2

D2TransferW1 S1

W2 S2

D1 S1

D2 S2

C1 S1

C2 S2

PAGE 255

Owner request




S1

W1C1

D1

S2

W2 D2Transfer

ResponseC2

Txn ops

C2

W1 S1

W2 S2

D1 S1

D2 S2

C1 S1

C2 S1

PAGE 256

Owner request




S1

W1C1

D1

S2

W2 D2C2

Txn ops

W1 S1

W2 S2

D1 S1

D2 S2

C1 S1

C2 S1

PAGE 257




Dynamic partitioning based on per transaction patterns

Does not consider workload overall

PAGE 258

● How to form partitions?Transaction localization

● Where to place partitions?At requester

● How to execute multi-partition operations?L-Store protocol


(Lin et al., SIGMOD 2016) PAGE 259

● How to form partitions?Key groups, temporarily

● Where to place partitions?Key group leader

● How to execute multi-partition operations?Key group protocol


(Das et al., SoCC 2010) PAGE 260

Localizing Transactions

A, B

W[A,B]

Commit

Commit Locally Without Synchronization!

PAGE 261

C, D

Constructing the Graph

ID Name

A Alice

B Bob

C Carol

From a workload trace

(Curino et al., VLDB 2010)PAGE 262


ID Name

A Alice

B Bob

C Carol

Add traced transactions: R[A,B], 3x W[A,C]



ID Name

A Alice

B Bob

C Carol

Add node weights (size, load)



ID Name

A Alice

B Bob

C Carol

Min-cut edges subject to weight imbalance

k=2(Curino et al., VLDB 2010)

PAGE 265

High Edge Cuts!


ID Name

A Alice

B Bob

C Carol



PAGE 266

Imbalanced!


ID Name

A Alice

B Bob

C Carol



PAGE 267

Adding Replica SupportR[A,B] 3x W[A,C]


Adding Replica SupportR[A,B] 3x W[A,C]

(Curino et al., VLDB 2010)

Holistic Partitioning/Replication

Offline and Periodic

PAGE 269

Access Patterns Change!

(Nicoara et al., EDBT 2015)

W[A,B], 3x W[A,C], W[A,D], W[C,D] 3x R[B,C]

PAGE 270

W(P1)=4, W(P2)=10, EC=8

Two Phases


Phase 1

PAGE 271

Two Phases


Phase 2

Rule:- Movement doesn’t overload- Move best-gain candidates- If overloaded, must move!

PAGE 272

Logical Movement, then Migrate

Two Phases


Already Overloaded

Phase 1

W(P1)=4, W(P2)=10, EC=8, Bounds: (6,8)

PAGE 273

Two Phases


Phase 2

W(P1)=4, W(P2)=10, EC=8, Bounds: (6,8)

Gain=0

PAGE 274

Two Phases


Already Overloaded

Phase 1

W(P1)=5, W(P2)=9, EC=8, Bounds: (6,8)

PAGE 275

Two Phases


Phase 2

W(P1)=5, W(P2)=9, EC=8, Bounds: (6,8)

Gain=-1

PAGE 276

Convergence


W(P1)=6, W(P2)=8, EC=9, Bounds: (6,8)

PAGE 277

Stable

● How to form partitions?Graph partitioning

● Where to place partitions?Based on partitioning

● How to execute multi-partition operations?2PC


PAGE 278(Curino et al., VLDB 2010) (Nicoara et al., EDBT 2015)

Graph Partitioning

X

A

B C

DMinimizes total number of distributed transactions

Ignores per node involvement

PAGE 279

Balance load and minimize distributed transactions

Adaptive Database Partitioning

PAGE 280

Database elasticity

Clay (Serafini et al., VLDB 2015)

P-Store (Taft et al., SIGMOD 2018)

E-Store (Taft et al., VLDB 2014)

Considering Distributed Cost

(Serafini et al., VLDB 2016)

General Graph Partitioning:

Clay:

minimize # edge cuts

load(Si) < (1 + ε) avg load( S )

load(Si) = ∑ w(v)

load(Si) = ∑ w(v) + k ∑ w( uv ) (v at Si)(u not at Si)

(v at Si)General:

such that:

Distributed cost

load balanced

PAGE 281

Repartitioning Cost


General Graph Partitioning: minimize # edge cuts

Clay: minimize # edge cuts

cost of repartitioning

and# of vertices mapped to new partitions

PAGE 282


F

E

D

C

B

A

S1 S2

S3

MedLow HighClumping

PAGE 283


F

E

D

C

B

A

S1 S2

S3

G

MedLow HighClumping

PAGE 284


F

E

D

C

B

A

S1 S2

S3

G Form clump

MedLow HighClumping

PAGE 285


F

E

D

C

B

A

S1 S2

S3

G Migrate clump

Increases cost

MedLow HighClumping

PAGE 286


F

E

D

C

B

A

S1 S2

S3

G

MedLow HighClumping

PAGE 287


F

E

D

C

B

A

S1 S2

S3

G Expand clump

MedLow HighClumping

PAGE 288


EF

D

C

B

A

S1

S2

S3

GMigrate clump

MedLow HighClumping

PAGE 289


FE

D

C

B

A

S1

S2

S3

GExpand clump

MedLow HighClumping

PAGE 290


FE

D C

B

A

S1

S2

S3

GMigrate clump

MedLow HighClumping

PAGE 291


FE

D

C

B

A

S1

S2

S3

G

MedLow HighClumping

PAGE 292

Termination

Clumping


Expands clumps to frequently accessed neighbours

Consider moving clump to lightly loaded sites

Considers both re-partitioning and load costs

PAGE 293

Elasticity

(Taft et al., VLDB 2015)

F

E

D

C

B

A

S1 S2

S3

MedLow High

PAGE 294

Elasticity


F

E

D

C

B

A

S1 S2

S3

MedLow High

PAGE 295

Elasticity


F

E

D

C

B

A

S1 S2

S3

MedLow High

S4

PAGE 296

Elasticity


Repartition to elastically add or remove nodes

PAGE 297

Elasticity


F

E

D

C

B

A

S1 S2

S3

MedLow High

S4

PAGE 298

Elasticity


F

E

D

C

B

A

S1 S2

MedLow High

S4

PAGE 299

Elasticity Decisions


When the average load:

increases: add nodes

decreases: remove nodes

PAGE 300

Two Tier Data Placement


Identify hot data

Evenly distribute hot data

Distribute cold data over remaining capacity

PAGE 301

Identifying Hot Data


Monitor partition level access frequency

If hot partition enable tuple level monitoring

Reacts to changes in load

PAGE 302

Reactive Elasticity


Load

Capacity

Time

PAGE 303

Reactive Elasticity


Load

Capacity

Time

PAGE 304

Reactive Elasticity


Load

Capacity

Time

PAGE 305

Reactive Elasticity


Load

Capacity

Time

PAGE 306

Reactive Elasticity


Load

Capacity

Time

PAGE 307

Reactive Elasticity


Load

Capacity

Time

PAGE 308

Reactive Elasticity


Load

Capacity

Time

PAGE 309

Ideal Elasticity

(Taft et al., SIGMOD 2018)

Load

Capacity

Time

predict the function

PAGE 310

Periodic Workloads


Daily load variations

Seasonal load spikesPAGE 311

How to Predict Load


load(t) = avg_load( t - pi) + change_in_load( t - ji)

= +Load Periodicity Trend

SPAR: Sparse Periodic Auto-Regression

PAGE 312

decide the # of nodes

Ideal Elasticity


Load

Capacity

Time

PAGE 313

Number of Nodes


# of nodes Predicted Load

Load per Server=

Assuming partitionable

PAGE 314


● How to form partitions?Heuristically (Clumping versus 2 Tier)

● Where to place partitions?React or predict based on load

● How to execute multi-partition operations?2PC


(Taft et al., VLDB 2015)(Taft et al., SIGMOD 2018) PAGE 315



● Outlook

Road Map

PAGE 316

Outlook

PAGE 317


Adaptive Systems

Adaptively replicate and partition

PAGE 318

● How to form partitions?

● Where to place partitions?



Iterative, Temporarily, Graph partitioning, Heuristic

Sorted, Leader, At requester, Graph partitioning, Reactively, Predictively

Novel protocols, 2PCPAGE 319




Replication Decisions Decentralized, Client workload, Cost-based, Predictive, Fault tolerance

At requester, Heuristic, Cost-based, Predictive, Dynamic

Synchronous, Quorums, Single-master, Cache PAGE 320



● Where to place partitions?

Decisions Predictively

Predictively

Predictively

PAGE 321


Adaptive & Predictive Systems

Adaptively and predictively replicate and partition

PAGE 322

Predicting the FutureHow can your system predict its future workload?

Apollo: Predict future queries (Markov Model)

P-Store: Predict future load (SPAR)

PAGE 323

Predicting the Future QB5000When, how many, and what queries will arrive?

(Ma et al., SIGMOD 2018)

Pre-process: remove parameters, creating templates

SELECT * FROM C WHERE id = “C1”

SELECT * FROM C WHERE id = $

PAGE 324



Cluster: group templates by arrival rate

T1 T2 T3

PAGE 325



Forecast: Predict clusters arrival rate (Ensemble of RNN, LR, KR)

PAGE 326



Pre-process: remove parameters, creating templatesCluster: group templates by arrival rate

Forecast: Predict clusters arrival rate (Ensemble of RNN, LR, KR)

PAGE 327


Apollo: Predict future queries (Markov Model)

P-Store: Predict future load (SPAR)

QB5000: Predict query workloads (Ensemble of RNN, LR, KR)

PAGE 328

If your system knew the future workload, how could it partition and replicate data?


PAGE 329

adaptive replication and partitioning in data systemsmtabebe/resources/... · hierarchical p2p file...

Documents