adaptive replication and partitioning in data systemsmtabebe/resources/... · hierarchical p2p file...
TRANSCRIPT
Adaptive Replication and Partitioning
in Data Systems
Brad Glasbergen, Michael Abebe,
Khuzaima DaudjeeMiddleware 2018
Data Systems Group
BA .APAGE 2
PAGE 3
PAGE 4
PAGE 5
S1
A
Single Node Architecture
BOverloaded
PAGE 6
Single Node Architecture
How to scale beyond a single node?
Replicate and partition
PAGE 7
Single Node Architecture
How to scale beyond a single node?
Replicate and partition
PAGE 8
S1
A
Replicated Architecture
B
S2 S3
A B A B
Handle more requests PAGE 9
S1
A
Replicated Architecture
B
S2 S3
A B A B
Cost of coordinationPAGE 10
S1
A
Replicated Architecture
S2 S3
A B B
How many replicas?PAGE 11
S1
A
Replicated Architecture
S2 S3
A B B
W[ A ] W[ A ]
PAGE 12
S1
A
Replicated Architecture
S2 S3
AB B
W[ A ] W[ A ]
Where to place replicas?PAGE 13
S1
A
Replicated Architecture
S2 S3
AB B
How to propagate updates?
W[ A ]
PAGE 14
● (A)synchronous● Consistency
● How many replicas?
● Where to place replicas?
● How to propagate updates?
Replication Decisions
PAGE 15
Single Node Architecture
How to scale beyond a single node?
Replicate and partition
PAGE 16
Partitioned Architecture
Distributes requests S1
A
S2
B
PAGE 17
Partitioned Architecture
S1 S2
BA
PAGE 18
Partitioned Architecture
How to form partitions? S1
A1
S2
BA2
PAGE 19
Partitioned Architecture
Where to place partitions? S1
A1
S2
BA2
PAGE 20
Partitioned Architecture
How to execute multi-partition operations?
S1
A
S2
B
PAGE 21
Partitioning Decisions
PAGE 22
● How to form partitions? ● Where to place partitions?
● How to execute multi-partition operations?
Where to place partitions?
S1
A1
S3
BA2
PAGE 23
How to make a partitioning or replication decision when access patterns change?
Static Decisions
Why do access patterns change?
PAGE 24
Why do accesses change?
Humans have follow-the-sun cycles
PAGE 25
Why do accesses change?
Load bursts
PAGE 26
Why do accesses change?
Shifting hot-spotsPAGE 27
How to make a partitioning or replication decision when access patterns change?
Static Decisions
Adaptively replicate and partition
PAGE 28
Where to place partitions?
S1
A1
S2
BA2
PAGE 29
Where to place partitions?
S1
A1
S2
A2B
PAGE 30
How many replicas?
S1
A1
S2
A2B
R[ B ]
PAGE 31
How many replicas?
S1
A1
S2
A2B
R[ B ]
B
PAGE 32
● Adaptive Replication
● Adaptive Partitioning
● Outlook
Road Map
PAGE 33
Adaptive Replication
PAGE 34
● How many replicas?
● Where to place replicas?
● How to propagate updates?
Replication Decisions
PAGE 35
Adaptive Replication ● Decentralized
● Geo-Distributed
● Caching
● Availability
PAGE 36
S2
Adaptive Replication (ADR)
(Wolfson et al., TODS 1997)
S1 S3
S4 S5 S6
PAGE 37
Adaptive Replication (ADR)
(Wolfson et al., TODS 1997)
S2 S1 S3
PAGE 38
Adaptive Replication (ADR)
(Wolfson et al., TODS 1997)
A
S2 S1 S3
PAGE 39
S2
Local Read
(Wolfson et al., TODS 1997)
A
R[A] S1 S3
PAGE 40
Remote Read
(Wolfson et al., TODS 1997)
A
S2 S1 S3
PAGE 41
R[A] S2
Remote Read
(Wolfson et al., TODS 1997)
A
S1 S3
PAGE 42
R[A] S2
Remote Read
(Wolfson et al., TODS 1997)
A
S1 S3
More Messages!
PAGE 43
R[A] S2
Replication for Reads
(Wolfson et al., TODS 1997)
A
S1 S3
A’
PAGE 44
W[A] S2
No Free Lunch
(Wolfson et al., TODS 1997)
A
S1 S3
A’
PAGE 45
W[A] S2
No Free Lunch
(Wolfson et al., TODS 1997)
A
S1 S3
A’
Reduces read cost,Increases write cost
PAGE 46
When to Replicate?
(Wolfson et al., TODS 1997)
A
S2 S1 S3
PAGE 47
S2
When to Replicate?
(Wolfson et al., TODS 1997)
A
Reads: 15
Writes: 5
Writes: 5
S1 S3
PAGE 48
When to Replicate?
(Wolfson et al., TODS 1997)
A
Reads: 0
Writes: 5
Writes: 5
A’
Writes: 10
S2 S1 S3
PAGE 49
When to Replicate?
(Wolfson et al., TODS 1997)
A
Reads: 0
Writes: 5
Writes: 5
A’
Writes: 10
5 Fewer Messages!
S2 S1 S3
PAGE 50
When to Stop Replicating?
(Wolfson et al., TODS 1997)
AA’
S2 S1 S3
PAGE 51
S2
When to Stop Replicating?
(Wolfson et al., TODS 1997)
AA’
Writes: 20Reads: 5
Reads: 5
S1 S3
PAGE 52
S2
When to Stop Replicating?
(Wolfson et al., TODS 1997)
A
Writes: 20Reads: 5
Reads: 5
10 Fewer Messages!
S1 S3
PAGE 53
Decentralized Decisions
(Wolfson et al., TODS 1997)
A
S2 S1 S3
Replicate?
PAGE 54
Adapts to Changing Workloads
(Wolfson et al., TODS 1997)
A
Replicate?
S2 S1 S3
More Reads
A’
PAGE 55
S2
Extensions (Network Topology)
(Wolfson et al., TODS 1997)
S1 S3
S4 S5 S6
PAGE 56
S2
Extensions (Network Topology)
(Wolfson et al., TODS 1997)
S1 S3
S4 S5 S6
PAGE 57
S2
Extensions (Network Topology)
(Wolfson et al., TODS 1997)
S1 S3
S4 S5 S6
Form Tree
PAGE 58
Extensions (Blocks)
(Wolfson et al., TODS 1997)
S2 S1 S3
A
B
C
⅓ R[A] + ⅓ R[B] + ⅓ R[C] ⅓ W[A] + ⅓ W[B] + ⅓ W[C]
A’
B’
C’
PAGE 59
Peer-to-Peer File Systems
PAGE 60
Hash-Based P2P File System
h(s) = 275
PAGE 61
Hash-Based P2P File System
PAGE 62
Hash-Based P2P File System
PAGE 63
Hash-Based P2P File System
PAGE 64
Hash-Based P2P FSBalanced load, but poor access locality!
PAGE 65
Hierarchical P2P File Systems
PAGE 66
Hierarchical P2P File Systems
PAGE 67
Hierarchical P2P File Systems
PAGE 68
Hierarchical P2P File Systems
PAGE 69
Hierarchical P2P File SystemGood locality,but poor balance
PAGE 70
(Gopalakrishnan et al., ICDCS’04)
Overloaded
Moderate Load
Light Load
Locality and Load Balance
PAGE 71
(Gopalakrishnan et al., ICDCS’04)
Overloaded
Moderate Load
Light Load
Locality and Load Balance
PAGE 72
(Gopalakrishnan et al., ICDCS’04)
Locality and Load Balance
PAGE 73
No replication
(Gopalakrishnan et al., ICDCS’04)
Replicate to balance load!
Locality and Load Balance
PAGE 74
(Gopalakrishnan et al., ICDCS’04)
Locality and Load Balance
PAGE 75
Locality and Load Balance
(Gopalakrishnan et al., ICDCS’04)
Replicate to balance load!
PAGE 76
(Gopalakrishnan et al., ICDCS’04)
Locality and Load Balance
PAGE 77
Decentralized Replicas
(Gopalakrishnan et al., ICDCS’04)
Writes?
PAGE 78
● How many replicas?Decentralized decision
● Where to place replicas?At the requester
● How to propagate updates?ADR: Synchronous, P2P: read-only
Replication Decisions
(Gopalakrishnan et al., ICDCS’04)(Wolfson et al., TODS 1997)PAGE 79
PAGE 80
Global Scale Replication
022 73
Average Latency31 ms
PAGE 81
PAGE 82
Global Scale Replication
185 159
142189
136022 73
Average Latency113 ms
PAGE 83
PAGE 84
Global Scale Replication
136
185 159
142180
0022 73
Average Latency 95 ms
PAGE 85
Global Scale Replication
301136
159
185 0
169
0022 73
Average Latency64 ms
133
PAGE 86
Minimize cost of access
Take me to your leader!(Sharov et al., VLDB 2015)
Global Scale Replication
GPlacer (Zakhary et al., EDBT 2018)
Place data around the world
PAGE 87
Data Distribution Model
(Sharov et al., VLDB 2015)
S1
A B
S4
A B
S2
A B
S3
A B
Replicas
PAGE 88
Data Distribution Model
(Sharov et al., VLDB 2015)
S1
A B
S4
a b
S2
A B
S3
A B
Read-Write Replicas
Read ReplicasState
changesPAGE 89
Data Distribution Model
(Sharov et al., VLDB 2015)
S1
A B
S4
a b
S2
A B
S3
A B
Read-Write Replicas
Read ReplicasLeader State
changesCoordinates
PAGE 90
RWWL
Global Replication Problem
(Sharov et al., VLDB 2015)
● Select replicas
● Assign replica roles (read or read-write)
● Assign leader
PAGE 91
Global Replication Problem
(Sharov et al., VLDB 2015)
● Select replicas
● Assign replica roles (read or read-write)
● Assign leader
PAGE 92
Assign Leader
(Sharov et al., VLDB 2015)
Leader: site that minimizes access costs
PAGE 93
+ median RTT( S1, {S1, S2, S3})
Write transaction cost
(Sharov et al., VLDB 2015)
S1
A B
S4
a b
S2
A B
S3
A B
Send write to leader
Quorum writes cost = RTT(S3, S1)
PAGE 94
RWWL
Assign Leader
(Sharov et al., VLDB 2015)
Leader: site that minimizes access costs
Client cost: RTT(client, replica) + cost( transaction )
PAGE 95
Weighting Client Cost
(Sharov et al., VLDB 2015)PAGE 96
Weighting Client Cost
(Sharov et al., VLDB 2015)PAGE 97
2 writes2 reads
10 writes20 reads
5 writes5 reads
Assign Leader
(Sharov et al., VLDB 2015)
Leader: site that minimizes access costs
Client cost: RTT(client, replica) + cost( transaction )
Cost: Weighted average of client costs
PAGE 98
Global Replication Problem
(Sharov et al., VLDB 2015)
● Select replicas
● Assign replica roles (read or read-write)
● Assign leader
PAGE 99
Assign Replica RolesLeader: minimizes median RTT to read-write replicas
(Sharov et al., VLDB 2015)
Read-write replicas:
PAGE 100
+ median RTT( S1, {S1, S2, S3})
Write transaction cost
(Sharov et al., VLDB 2015)
S1
A B
S4
a b
S2
A B
S3
A B
Send write to leader
Quorum writes cost = RTT(S3, S1)
PAGE 101
RWWL
Assign Replica RolesLeader: minimizes median RTT to read-write replicas
(Sharov et al., VLDB 2015)
Read-write replicas: Lowest RTT to leader
PAGE 102
● Select replicas
● Assign replica roles (read or read-write)
● Assign leader
Global Replication Problem
(Sharov et al., VLDB 2015)PAGE 103
Replica selection
(Sharov et al., VLDB 2015)
Read-write replicas: Lowest RTT to leader
Read replicas:
Leader: minimizes median RTT to read-write replicas
PAGE 104
Replica selection
(Sharov et al., VLDB 2015)PAGE 105
Client cost: RTT(client, replica) + cost( transaction )
Replica selection
(Sharov et al., VLDB 2015)PAGE 106
Client cost: RTT(client, replica) + cost( transaction )
Read replicas: Lowest RTT to clients
Replica selection
(Sharov et al., VLDB 2015)
Read-write replicas: Lowest RTT to leader
Read replicas: Lowest RTT to clients
Leader: minimizes median RTT to read-write replicas
PAGE 107
K-Means Replica selection
(Sharov et al., VLDB 2015)PAGE 108
K-Means Replica selection
(Sharov et al., VLDB 2015)
97
121
7365
88 0
0
0
220
Average Latency55 ms
PAGE 109
219
169
K-Means Replica selectionSelect replicas
(Sharov et al., VLDB 2015)
Assign leader and read-write replicas
PAGE 110
Leaderless Protocols
(Zakhary et al., EDBT 2018)
S1
A B
S4
a b
S2
A B
S3
A B
Any quorum member can coordinate
PAGE 111
Hinted Hand off
301136
159(Zakhary et al., EDBT 2018)
PAGE 112
Hinted Hand off
301136
159
295 < 301
(Zakhary et al., EDBT 2018)PAGE 113
Hinted Hand off
(Zakhary et al., EDBT 2018)
Hand off request from S1 to S2 if:
cost( S1 ) > RTT( S1, S2) + cost( S2 )
PAGE 114
cost( S ) = cost of executing request at S
● How many replicas?Centralized, given client workload
● Where to place replicas?Heuristic (clustering)
● How to propagate updates?Quorums / Leader-based (Sharov)
Replication Decisions
(Zakhary et al., EDBT 2018)(Sharov et al., VLDB 2015)PAGE 115
Intra-Region Latency
(Sivasubramanian et al., WWW 2005)
100+ ms
PAGE 116
Edge Nodes
(Sivasubramanian et al., WWW 2005)PAGE 117
Supports Static Data
Dynamic Data?
GlobeDB
(Sivasubramanian et al., WWW 2005)PAGE 118
Replication Granularity
(Sivasubramanian et al., WWW 2005)
ID ARTIST
1 Bryan Adams
2 Justin Bieber
3 Avril Lavigne
PAGE 119
Replication Granularity
(Sivasubramanian et al., WWW 2005)
Per-Record?High Overhead
PAGE 120
ID ARTIST
1 Bryan Adams
2 Justin Bieber
3 Avril Lavigne
Replication Granularity
(Sivasubramanian et al., WWW 2005)
Per-Table?Inflexible
PAGE 121
ID ARTIST
1 Bryan Adams
2 Justin Bieber
3 Avril Lavigne
Access-Driven Replicas
(Sivasubramanian et al., WWW 2005)
When would these be replicated together?
PAGE 122
ID ARTIST
1 Bryan Adams
2 Justin Bieber
3 Avril Lavigne
Access-Driven Replicas
(Sivasubramanian et al., WWW 2005)
Abryan = <r1,...rn,w1,...wn>
PAGE 123
ID ARTIST
1 Bryan Adams
2 Justin Bieber
3 Avril Lavigne
Access-Driven Replicas
(Sivasubramanian et al., WWW 2005)
Sim(Abryan,Ajustin) ≥τ?
PAGE 124
Ajustin = <r1,...rn,w1,...wn>
ID ARTIST
1 Bryan Adams
2 Justin Bieber
3 Avril Lavigne
Access-Driven Replicas
(Sivasubramanian et al., WWW 2005)
Shared Replication Scheme
PAGE 125
ID ARTIST
1 Bryan Adams
2 Justin Bieber
3 Avril Lavigne
Access-Driven Replicas
(Sivasubramanian et al., WWW 2005)
Ap1 = <r1,...rn,w1,...wn>
PAGE 126
ID ARTIST
1 Bryan Adams
2 Justin Bieber
3 Avril Lavigne
Access-Driven Replicas
(Sivasubramanian et al., WWW 2005)
Aavril = <r1,...rn,w1,...wn>
PAGE 127
ID ARTIST
1 Bryan Adams
2 Justin Bieber
3 Avril Lavigne
Sim(Ap1,Aavril) ≥τ?
Access-Driven Replicas
(Sivasubramanian et al., WWW 2005)PAGE 128
ID ARTIST
1 Bryan Adams
2 Justin Bieber
3 Avril Lavigne
Access-Driven Replicas
(Sivasubramanian et al., WWW 2005)PAGE 129
ID ARTIST
1 Bryan Adams
2 Justin Bieber
3 Avril Lavigne
4 Kanye West
5 Drake
6 David Guetta
7 Ed Sheeran
Transaction Processing
(Sivasubramanian et al., WWW 2005)
Origin Server:Decide Partitions, Place Replicas, Place Master
PAGE 130
Transaction Processing
(Sivasubramanian et al., WWW 2005)
W[B]
PAGE 131
M
Transaction Processing
(Sivasubramanian et al., WWW 2005)
Push updates
PAGE 132
M
Transaction Processing
(Sivasubramanian et al., WWW 2005)
R[B]
PAGE 133
M
Transaction Processing
(Sivasubramanian et al., WWW 2005)
Push updates R[B]
PAGE 134
M
Transaction Processing
(Sivasubramanian et al., WWW 2005)PAGE 135
M
M
Replica and Master Placement
(Sivasubramanian et al., WWW 2005)PAGE 136
Read Latency vs. Bandwidth
(Sivasubramanian et al., WWW 2005)PAGE 137
M
Read Latency vs. Bandwidth
(Sivasubramanian et al., WWW 2005)PAGE 138
M
Read Latency
Bandwidth
𝜶
𝜷
Master?
Master Partition Placement
(Sivasubramanian et al., WWW 2005)PAGE 139
Master Partition Placement
(Sivasubramanian et al., WWW 2005)PAGE 140
Placement Heuristic
(Sivasubramanian et al., WWW 2005)PAGE 141
M
100% Threshold
Placement Heuristic
(Sivasubramanian et al., WWW 2005)PAGE 142
M
95% Threshold
Placement Heuristic
(Sivasubramanian et al., WWW 2005)PAGE 143
M
30% Threshold
Placement Heuristic
(Sivasubramanian et al., WWW 2005)PAGE 144
M
5% Threshold
Placement Heuristic
(Sivasubramanian et al., WWW 2005)
Minimize:𝛼 r + 𝛽 b
PAGE 145
M
0% Threshold
● How many replicas?Cost-based given requests
● Where to place replicas?Cost-based given requests
● How to propagate updates?Single-master, eventual consistency
Replication Decisions
(Sivasubramanian et al., WWW 2005) PAGE 146
Web Workload Characteristics
(Glasbergen et al., EDBT 2018)
Read-heavy
Cache misses are very painful
PAGE 147
Web Workload Characteristics
(Glasbergen et al., EDBT 2018)PAGE 148
Web Workload Characteristics
(Glasbergen et al., EDBT 2018)PAGE 149
Web Workload Characteristics
(Glasbergen et al., EDBT 2018)PAGE 150
Predict(Q2)
Predictive Caching
(Glasbergen et al., EDBT 2018)
Q1 Forward(Q1)Result(Q1)
Cache(Q2)Q2
PAGE 151
Query Patterns (TPC-W)
(Glasbergen et al., EDBT 2018)PAGE 152
PAGE 153
Building a Predictive Model
(Glasbergen et al., EDBT 2018)PAGE 154
Time
Q1 Q2 Q3 Q1 Q2 Q3
Q1
Q2 Q3
1 1
Building a Predictive Model
(Glasbergen et al., EDBT 2018)PAGE 155
Time
Q1 Q2 Q3 Q1 Q2 Q3
Q1
Q2 Q3
1 1
1
Building a Predictive Model
(Glasbergen et al., EDBT 2018)PAGE 156
Time
Q1 Q2 Q3 Q1 Q2 Q3
Q1
Q2 Q3
1 1
1
1
Building a Predictive Model
(Glasbergen et al., EDBT 2018)PAGE 157
Q1
Q2 Q3
5 3
5
1All executed 5 times
100% probability Q2 follows Q1
20% probability Q1 follows Q3
Finding Parameter Mappings
(Glasbergen et al., EDBT 2018)PAGE 158
Finding Parameter Mappings
(Glasbergen et al., EDBT 2018)PAGE 159
Finding Parameter Mappings
(Glasbergen et al., EDBT 2018)PAGE 160
Finding Parameter Mappings
(Glasbergen et al., EDBT 2018)PAGE 161
Predictive Caching
(Glasbergen et al., EDBT 2018)PAGE 162
Q1
Q2 Q3
Predictive Caching
(Glasbergen et al., EDBT 2018)PAGE 163
Q1
Q2 Q3
Predictively Cache Q2
Predictive Caching
(Glasbergen et al., EDBT 2018)PAGE 164
Q1
Q2 Q3
Predictively Cache Q3
Apollo Deployment
(Glasbergen et al., EDBT 2018)
A
PAGE 165
A
AA
A
A A
AR[B]
Apollo Deployment
(Glasbergen et al., EDBT 2018)PAGE 166
AR[B]
R[C]
Invalidations
(Glasbergen et al., EDBT 2018)
W[B]
PAGE 167
Invalidations
(Glasbergen et al., EDBT 2018)
Invalidations Limit Cache Effectiveness
PAGE 168
Session Semantics
(Glasbergen et al., EDBT 2018)
R[B]
W[B]
PAGE 169
Session Semantics
(Glasbergen et al., EDBT 2018)
Good fit for web data!
R[B]
PAGE 170
● How many replicas?Predictively based on requests
● Where to place replicas?Client edge cache, predictively
● How to propagate updates?Cache updates with sessions
Replication Decisions
(Glasbergen et al., EDBT 2018) PAGE 171
PAGE 172
PAGE 173
PAGE 174
Replication for Availability
Failures are common
Data systems must remain available
PAGE 175
Replication for Availability
S1
C
S3 S2 S4 S5
DA
CA B
CDB
DB
CA
Tolerating r faults requires r + 1 replicas
B B B
PAGE 176
Lower Overhead
PAGE 177
B1
ReplicasData
XOR
B2 xor (B1 xor B2) = B1
B2
B1
B2
B1
B2
B1 xor B2
B1
B2
Erasure Coding
12...k
k + 1
...
k + r
12...k
Tolerating r faults requires (k+r)/k space
k partitionsr parity
partitions
Data
PAGE 178
Erasure Coding
S1 S3 S2 S4 S5
A1A2 A3 A4
W [ A ] Encode and storek = 2, r = 2
PAGE 179
Erasure Coding
S1 S3 S2 S4 S5
A1A2 A3 A4
R [ A ] Read and decodek = 2, r = 2
PAGE 180
Erasure CodingReduces storage overhead
Requires parallel retrieval
PAGE 181
Erasure Coded StorageWhere to place data
How to access dataEC-Store (Abebe, ICDCS 2018)
PAGE 182
EC-Store Data Access
(Abebe et al., ICDCS 2018)
S1 S3 S2 S4
A1 A2 A3B1 B2 B3 C1
R[ A, B ]
R[ A, B ]Load aware
S5
R[ C ]k = 2, r = 1
PAGE 183
EC-Store Data Access
(Abebe et al., ICDCS 2018)
Access Strategy: Minimize cost of access
Cost of site access: load at site + I/O at site
PAGE 184
EC-Store Data Movement
(Abebe et al., ICDCS 2018)
S1 S3 S2 S4
A1 A2 A3B1 B2 B3 C1
R[ A, B ] R[ C ]
R[ A, B ]Load aware
S5
A3
k = 2, r = 1
PAGE 185
EC-Store Data Movement
(Abebe et al., ICDCS 2018)
Move data to minimize cost of future accesses and balance system load
Model access patterns to predict future accesses
PAGE 186
● How many replicas?Fault tolerance requirements
● Where to place replicas?Dynamic movement, using access costs
● How to propagate updates?Synchronous updates
Replication Decisions
(Abebe et al., ICDCS 2018) PAGE 187
● Adaptive Replication
● Adaptive Partitioning
● Outlook
Road Map
PAGE 188
Adaptive Partitioning
PAGE 189
● How to form partitions? ● Where to place partitions?
● How to execute multi-partition operations?
Partitioning Decisions
PAGE 190
Adaptive Partitioning ● Iterative improvements
● Partitioning per request
● Considering the overall workload ○ Heuristics
PAGE 191
PAGE 192
PAGE 193
Physical Database Design
A B C D
1 2 4 2
2 4 6 8
3 6 7 5
PAGE 194
A B C D
1 2 4 8
2 4 6 3
3 6 7 10
1248
Physical Database Design
PAGE 195
A B C D
1 2 4 8
2 4 6 3
3 6 7 10
12482463
Physical Database Design
PAGE 196
A B C D
1 2 4 8
2 4 6 3
3 6 7 10
1248246336710
Physical Database Design
PAGE 197
A B C D
1 2 4 82 4 6 3
3 6 7 10
SELECT AVERAGE(C) FROM R WHERE R.D > 5;
Physical Database Design
PAGE 198
1248246336710
A B C D
1 2 4 82 4 6 3
3 6 7 10
SELECT AVERAGE(C) FROM R WHERE R.D > 5;
AAA
...
...
CCC
DDD
Scan
Analytic Database Design
PAGE 199
SELECT AVERAGE(C) FROM R WHERE R.D > 5;
AAA...
...
CCC
DDD
Need to know what to index upfront
Index on D
Analytic Database Design
PAGE 200
Adaptive Range IndexingSELECT ... FROM R WHERER.D > 5;SELECT … FROM R WHERE R.D > 5 AND R.D < 10;SELECT … FROM R WHERE R.D > 10 AND R.D < 20;
PAGE 201
Database Cracking83
1000425122215
174221
Column D83
1000425122215
174221
Cracked Column D
Copy
(Idreos et al., CIDR 2007)PAGE 202
Indexes via Partitioning83
1000425122215
174221
Cracked Column D
SELECT … WHERE R.D > 5
(Idreos et al., CIDR 2007)PAGE 203
Indexes via Partitioning
(Idreos et al., CIDR 2007)
83
1000425122215
174221
SELECT … WHERE R.D > 5
Cracked Column D
PAGE 204
Indexes via Partitioning
(Idreos et al., CIDR 2007)
83
1000425122215
174221
SELECT … WHERE R.D > 5
Cracked Column D
PAGE 205
Indexes via Partitioning
(Idreos et al., CIDR 2007)
83
1000425122215
174221
SwapSELECT … WHERE R.D > 5
Cracked Column D
PAGE 206
Indexes via Partitioning
(Idreos et al., CIDR 2007)
53
1000425122218
174221
SwapSELECT … WHERE R.D > 5
Cracked Column D
PAGE 207
Indexes via Partitioning
(Idreos et al., CIDR 2007)
53
1000425122218
174221
SwapSELECT … WHERE R.D > 5
Cracked Column D
PAGE 208
Indexes via Partitioning
(Idreos et al., CIDR 2007)
5314
251222108
174221
SwapSELECT … WHERE R.D > 5
Cracked Column D
PAGE 209
Indexes via Partitioning
(Idreos et al., CIDR 2007)
5314
251222108
174221
SELECT … WHERE R.D > 5
Cracked Column D
PAGE 210
Indexes via Partitioning
(Idreos et al., CIDR 2007)
5314
251222108
174221
SELECT … WHERE R.D > 5 AND R.D < 10
Only need to consider these
Cracked Column D
PAGE 211
Indexes via Partitioning
(Idreos et al., CIDR 2007)
53148
10171225224221
SELECT … WHERE R.D > 5 AND R.D < 10
Cracked Column D
PAGE 212
Indexes via Partitioning
(Idreos et al., CIDR 2007)
53148
10171225224221
Only need to consider these
SELECT … WHERE R.D > 10 AND R.D < 20
Cracked Column D
PAGE 213
Indexes via Partitioning
(Idreos et al., CIDR 2007)
53148
10171225224221
SELECT … WHERE R.D > 10 AND R.D < 20
Cracked Column D
Iterative Partitioning forIndexing
PAGE 214
Cracked Column D53148
10171225224221
Advanced Cracking Methods
Distribution
(Idreos et al., CIDR 2007)
Cracking: Extensions
PAGE 215
● How to form partitions?Iteratively, based on queries
● Where to place partitions?Sorted in memory
● How to execute multi-partition operations?N/A
Partitioning Decisions
(Idreos et al., CIDR 2007) PAGE 216
Exploratory Workloads?
App Usage Time
PAGE 217
Exploratory Workloads?
Usage By Device
PAGE 218
Exploratory Workloads?
Revenue By Device
PAGE 219
Exploratory Workloads?
Devices By Country
No upfront information, need generic partitioning!
PAGE 220
Initial Partitioning (KD-Tree)Depth Limits Division
64 MB
128 MB
256 MB
512 MB
(Shanbhag et al., SoCC 2017)PAGE 221
Heterogeneous Tree
(Shanbhag et al., SoCC 2017)
Contains more attributes!
PAGE 222
Building the Partitioning
(Shanbhag et al., SoCC 2017)
A: 0.0B: 0.0C: 0.0D: 0.0
PAGE 223
Building the Partitioning
(Shanbhag et al., SoCC 2017)
A: 1.0B: 0.0C: 0.0D: 0.0
PAGE 224
Building the Partitioning
(Shanbhag et al., SoCC 2017)
A: 1.0B: 0.5C: 0.0D: 0.0
PAGE 225
Building the Partitioning
(Shanbhag et al., SoCC 2017)
A: 1.0B: 0.5C: 0.5D: 0.0
PAGE 226
Building the Partitioning
(Shanbhag et al., SoCC 2017)
A: 1.0B: 0.5C: 0.5D: 0.5
PAGE 227
Building the Partitioning
(Shanbhag et al., SoCC 2017)
A: 1.0B: 0.75C: 0.5D: 0.5
PAGE 228
Q1 =σD≤45
Adaptive Partitioning
(Shanbhag et al., SoCC 2017)PAGE 229
Adaptive Partitioning
(Shanbhag et al., SoCC 2017)
Q2 =σA≧125Refine partitioning per the workload!
PAGE 230
Adaptive Partitioning: When?
(Shanbhag et al., SoCC 2017)
Q1 =σD≤45Q1, Q2, Q3, Q1, ...
PAGE 231
Swap Operation
(Shanbhag et al., SoCC 2017)
Q1 =σD≤45Q1, Q2, Q3, Q1, ...
PAGE 232
Rewrite Tree
Push Up Operation
(Shanbhag et al., SoCC 2017)PAGE 233
Q1 =σD≤45Q1, Q2, Q3, Q1, ...
Push up
Push Up Operation
(Shanbhag et al., SoCC 2017)PAGE 234
Q1 =σD≤45Q1, Q2, Q3, Q1, ...
Logical Movement
Divide and ConquerQ1 =σD≤45
Get Best Subtree
Get Best Subtree
PAGE 235(Shanbhag et al., SoCC 2017)
● How to form partitions?Upfront then iteratively, based on queries
● Where to place partitions?Rely on HDFS
● How to execute multi-partition operations?Rely on HDFS
Partitioning Decisions
(Shanbhag et al., SoCC 2017) PAGE 236
PAGE 237
Exploiting Workloads● Known ahead of time
● Parameterized
● Repetitive
PAGE 238
Exploiting Workloads - OLTP
PAGE 239
Warehouse
District
Customer
Partitioning OLTPWrite [ W1, D1, C1 ]
S1
W1C1
D1
S2
W2C2
D2
PAGE 240
Partitioning OLTP
S1
W1C1
D1
S2
W2C2
D2
Write [ W1, D1, C2 ]
prepare to commit
commit
PAGE 241
Two phase commit
Per transaction partitioning
Workload based repartitioning
Partitioning OLTP
G-Store(Das et al., SoCC 2010)
L-Store (Lin et al., SIGMOD 2016)
Later in the tutorial
PAGE 242
Create group
Key Grouping
S1
C1
D1
S2
W2C2
D2
(Das et al., SoCC 2010)
W1
PAGE 243
Write[ W1, D1, C2 ]
Create group
Key Grouping
S1
C1
D1
S2
W2C2
D2
(Das et al., SoCC 2010)
W1
Join request
PAGE 244
Write[ W1, D1, C2 ]
Create group
Key Grouping
S1
C1
D1
S2
W2C2
D2
(Das et al., SoCC 2010)
W1
Join request
JoinedC2 Propagate
Txn ops
PAGE 245
Write[ W1, D1, C2 ]
Create group
Key Grouping
S1
C1
D1
S2
W2C2
D2
(Das et al., SoCC 2010)
W1
Join request
JoinedC2 Propagate
Txn ops
Delete group
PAGE 246
Write[ W1, D1, C2 ]
Create group
Key Grouping
S1
C1
D1
S2
W2C2
D2
(Das et al., SoCC 2010)
W1
Join request
JoinedC2 Propagate
Txn ops
Delete group
FreePAGE 247
Write[ W1, D1, C2 ]
Create group
Key Grouping
S1
C1
D1
S2
W2C2
D2
(Das et al., SoCC 2010)
W1
Join request
Joined
Propagate
Txn ops
Delete group
FreePAGE 248
Write[ W1, D1, C2 ]
On demand transactional partitioning
Key Grouping
Works best when groups are small and transactions contain multiple operations
(Das et al., SoCC 2010)
But groups are transient
PAGE 249
Localizing Execution
(Lin et al., SIGMOD 2016)
Repartition data via localization for single site execution
Dynamic partitioning based on transaction patterns
PAGE 250
Localizing Execution
(Lin et al., SIGMOD 2016)
S1
W1C1
D1
S2
W2C2
D2
Ownership information
D2 S2
C1 S1
C2 S2
W1 S1
W2 S2
D1 S1
PAGE 251
Localizing Execution
(Lin et al., SIGMOD 2016)
S1
W1C1
D1
S2
W2C2
D2
Ownership information
D2 S2
C1 S1
C2 S2
W1 S1
W2 S2
D1 S1
PAGE 252
Localizing Execution
(Lin et al., SIGMOD 2016)
S1
W1C1
D1
S2
W2C2
D2
Write [ W1, D1, C1 ]
W1 S1
W2 S2
D1 S1
D2 S2
C1 S1
C2 S2
PAGE 253
Localizing Execution
(Lin et al., SIGMOD 2016)
S1
W1C1
D1
S2
W2C2
D2
Owner request
W1 S1
W2 S2
D1 S1
D2 S2
C1 S1
C2 S2
PAGE 254
Write [ W1, D1, C2 ]
Localizing Execution
(Lin et al., SIGMOD 2016)
S1
W1C1
D1
S2
W2C2
D2TransferW1 S1
W2 S2
D1 S1
D2 S2
C1 S1
C2 S2
PAGE 255
Owner request
Write [ W1, D1, C2 ]
Localizing Execution
(Lin et al., SIGMOD 2016)
S1
W1C1
D1
S2
W2 D2Transfer
ResponseC2
Txn ops
C2
W1 S1
W2 S2
D1 S1
D2 S2
C1 S1
C2 S1
PAGE 256
Owner request
Write [ W1, D1, C2 ]
Localizing Execution
(Lin et al., SIGMOD 2016)
S1
W1C1
D1
S2
W2 D2C2
Txn ops
W1 S1
W2 S2
D1 S1
D2 S2
C1 S1
C2 S1
PAGE 257
Write [ W1, D1, C2 ]
Localizing Execution
(Lin et al., SIGMOD 2016)
Dynamic partitioning based on per transaction patterns
Does not consider workload overall
PAGE 258
● How to form partitions?Transaction localization
● Where to place partitions?At requester
● How to execute multi-partition operations?L-Store protocol
Partitioning Decisions
(Lin et al., SIGMOD 2016) PAGE 259
● How to form partitions?Key groups, temporarily
● Where to place partitions?Key group leader
● How to execute multi-partition operations?Key group protocol
Partitioning Decisions
(Das et al., SoCC 2010) PAGE 260
Localizing Transactions
A, B
W[A,B]
Commit
Commit Locally Without Synchronization!
PAGE 261
C, D
Constructing the Graph
ID Name
A Alice
B Bob
C Carol
From a workload trace
(Curino et al., VLDB 2010)PAGE 262
Constructing the Graph
ID Name
A Alice
B Bob
C Carol
Add traced transactions: R[A,B], 3x W[A,C]
(Curino et al., VLDB 2010)PAGE 263
Constructing the Graph
ID Name
A Alice
B Bob
C Carol
Add node weights (size, load)
(Curino et al., VLDB 2010)PAGE 264
Constructing the Graph
ID Name
A Alice
B Bob
C Carol
Min-cut edges subject to weight imbalance
k=2(Curino et al., VLDB 2010)
PAGE 265
High Edge Cuts!
Constructing the Graph
ID Name
A Alice
B Bob
C Carol
Min-cut edges subject to weight imbalance
k=2(Curino et al., VLDB 2010)
PAGE 266
Imbalanced!
Constructing the Graph
ID Name
A Alice
B Bob
C Carol
Min-cut edges subject to weight imbalance
k=2(Curino et al., VLDB 2010)
PAGE 267
Adding Replica SupportR[A,B] 3x W[A,C]
(Curino et al., VLDB 2010)PAGE 268
Adding Replica SupportR[A,B] 3x W[A,C]
(Curino et al., VLDB 2010)
Holistic Partitioning/Replication
Offline and Periodic
PAGE 269
Access Patterns Change!
(Nicoara et al., EDBT 2015)
W[A,B], 3x W[A,C], W[A,D], W[C,D] 3x R[B,C]
PAGE 270
W(P1)=4, W(P2)=10, EC=8
Two Phases
(Nicoara et al., EDBT 2015)
Phase 1
PAGE 271
Two Phases
(Nicoara et al., EDBT 2015)
Phase 2
Rule:- Movement doesn’t overload- Move best-gain candidates- If overloaded, must move!
PAGE 272
Logical Movement, then Migrate
Two Phases
(Nicoara et al., EDBT 2015)
Already Overloaded
Phase 1
W(P1)=4, W(P2)=10, EC=8, Bounds: (6,8)
PAGE 273
Two Phases
(Nicoara et al., EDBT 2015)
Phase 2
W(P1)=4, W(P2)=10, EC=8, Bounds: (6,8)
Gain=0
PAGE 274
Two Phases
(Nicoara et al., EDBT 2015)
Already Overloaded
Phase 1
W(P1)=5, W(P2)=9, EC=8, Bounds: (6,8)
PAGE 275
Two Phases
(Nicoara et al., EDBT 2015)
Phase 2
W(P1)=5, W(P2)=9, EC=8, Bounds: (6,8)
Gain=-1
PAGE 276
Convergence
(Nicoara et al., EDBT 2015)
W(P1)=6, W(P2)=8, EC=9, Bounds: (6,8)
PAGE 277
Stable
● How to form partitions?Graph partitioning
● Where to place partitions?Based on partitioning
● How to execute multi-partition operations?2PC
Partitioning Decisions
PAGE 278(Curino et al., VLDB 2010) (Nicoara et al., EDBT 2015)
Graph Partitioning
X
A
B C
DMinimizes total number of distributed transactions
Ignores per node involvement
PAGE 279
Balance load and minimize distributed transactions
Adaptive Database Partitioning
PAGE 280
Database elasticity
Clay (Serafini et al., VLDB 2015)
P-Store (Taft et al., SIGMOD 2018)
E-Store (Taft et al., VLDB 2014)
Considering Distributed Cost
(Serafini et al., VLDB 2016)
General Graph Partitioning:
Clay:
minimize # edge cuts
load(Si) < (1 + ε) avg load( S )
load(Si) = ∑ w(v)
load(Si) = ∑ w(v) + k ∑ w( uv ) (v at Si)(u not at Si)
(v at Si)General:
such that:
Distributed cost
load balanced
PAGE 281
Repartitioning Cost
(Serafini et al., VLDB 2016)
General Graph Partitioning: minimize # edge cuts
Clay: minimize # edge cuts
cost of repartitioning
and# of vertices mapped to new partitions
PAGE 282
(Serafini et al., VLDB 2016)
F
E
D
C
B
A
S1 S2
S3
MedLow HighClumping
PAGE 283
(Serafini et al., VLDB 2016)
F
E
D
C
B
A
S1 S2
S3
G
MedLow HighClumping
PAGE 284
(Serafini et al., VLDB 2016)
F
E
D
C
B
A
S1 S2
S3
G Form clump
MedLow HighClumping
PAGE 285
(Serafini et al., VLDB 2016)
F
E
D
C
B
A
S1 S2
S3
G Migrate clump
Increases cost
MedLow HighClumping
PAGE 286
(Serafini et al., VLDB 2016)
F
E
D
C
B
A
S1 S2
S3
G
MedLow HighClumping
PAGE 287
(Serafini et al., VLDB 2016)
F
E
D
C
B
A
S1 S2
S3
G Expand clump
MedLow HighClumping
PAGE 288
(Serafini et al., VLDB 2016)
EF
D
C
B
A
S1
S2
S3
GMigrate clump
MedLow HighClumping
PAGE 289
(Serafini et al., VLDB 2016)
FE
D
C
B
A
S1
S2
S3
GExpand clump
MedLow HighClumping
PAGE 290
(Serafini et al., VLDB 2016)
FE
D C
B
A
S1
S2
S3
GMigrate clump
MedLow HighClumping
PAGE 291
(Serafini et al., VLDB 2016)
FE
D
C
B
A
S1
S2
S3
G
MedLow HighClumping
PAGE 292
Termination
Clumping
(Serafini et al., VLDB 2016)
Expands clumps to frequently accessed neighbours
Consider moving clump to lightly loaded sites
Considers both re-partitioning and load costs
PAGE 293
Elasticity
(Taft et al., VLDB 2015)
F
E
D
C
B
A
S1 S2
S3
MedLow High
PAGE 294
Elasticity
(Taft et al., VLDB 2015)
F
E
D
C
B
A
S1 S2
S3
MedLow High
PAGE 295
Elasticity
(Taft et al., VLDB 2015)
F
E
D
C
B
A
S1 S2
S3
MedLow High
S4
PAGE 296
Elasticity
(Taft et al., VLDB 2015)
Repartition to elastically add or remove nodes
PAGE 297
Elasticity
(Taft et al., VLDB 2015)
F
E
D
C
B
A
S1 S2
S3
MedLow High
S4
PAGE 298
Elasticity
(Taft et al., VLDB 2015)
F
E
D
C
B
A
S1 S2
MedLow High
S4
PAGE 299
Elasticity Decisions
(Taft et al., VLDB 2015)
When the average load:
increases: add nodes
decreases: remove nodes
PAGE 300
Two Tier Data Placement
(Taft et al., VLDB 2015)
Identify hot data
Evenly distribute hot data
Distribute cold data over remaining capacity
PAGE 301
Identifying Hot Data
(Taft et al., VLDB 2015)
Monitor partition level access frequency
If hot partition enable tuple level monitoring
Reacts to changes in load
PAGE 302
Reactive Elasticity
(Taft et al., VLDB 2015)
Load
Capacity
Time
PAGE 303
Reactive Elasticity
(Taft et al., VLDB 2015)
Load
Capacity
Time
PAGE 304
Reactive Elasticity
(Taft et al., VLDB 2015)
Load
Capacity
Time
PAGE 305
Reactive Elasticity
(Taft et al., VLDB 2015)
Load
Capacity
Time
PAGE 306
Reactive Elasticity
(Taft et al., VLDB 2015)
Load
Capacity
Time
PAGE 307
Reactive Elasticity
(Taft et al., VLDB 2015)
Load
Capacity
Time
PAGE 308
Reactive Elasticity
(Taft et al., VLDB 2015)
Load
Capacity
Time
PAGE 309
Ideal Elasticity
(Taft et al., SIGMOD 2018)
Load
Capacity
Time
predict the function
PAGE 310
Periodic Workloads
(Taft et al., SIGMOD 2018)
Daily load variations
Seasonal load spikesPAGE 311
How to Predict Load
(Taft et al., SIGMOD 2018)
load(t) = avg_load( t - pi) + change_in_load( t - ji)
= +Load Periodicity Trend
SPAR: Sparse Periodic Auto-Regression
PAGE 312
decide the # of nodes
Ideal Elasticity
(Taft et al., SIGMOD 2018)
Load
Capacity
Time
PAGE 313
Number of Nodes
(Taft et al., SIGMOD 2018)
# of nodes Predicted Load
Load per Server=
Assuming partitionable
PAGE 314
(Serafini et al., VLDB 2016)
● How to form partitions?Heuristically (Clumping versus 2 Tier)
● Where to place partitions?React or predict based on load
● How to execute multi-partition operations?2PC
Partitioning Decisions
(Taft et al., VLDB 2015)(Taft et al., SIGMOD 2018) PAGE 315
● Adaptive Replication
● Adaptive Partitioning
● Outlook
Road Map
PAGE 316
Outlook
PAGE 317
How to make a partitioning or replication decision when access patterns change?
Adaptive Systems
Adaptively replicate and partition
PAGE 318
● How to form partitions?
● Where to place partitions?
● How to execute multi-partition operations?
Partitioning Decisions
Iterative, Temporarily, Graph partitioning, Heuristic
Sorted, Leader, At requester, Graph partitioning, Reactively, Predictively
Novel protocols, 2PCPAGE 319
● How many replicas?
● Where to place replicas?
● How to propagate updates?
Replication Decisions Decentralized, Client workload, Cost-based, Predictive, Fault tolerance
At requester, Heuristic, Cost-based, Predictive, Dynamic
Synchronous, Quorums, Single-master, Cache PAGE 320
● How many replicas?
● Where to place replicas?
● Where to place partitions?
Decisions Predictively
Predictively
Predictively
PAGE 321
How to make a partitioning or replication decision when access patterns change?
Adaptive & Predictive Systems
Adaptively and predictively replicate and partition
PAGE 322
Predicting the FutureHow can your system predict its future workload?
Apollo: Predict future queries (Markov Model)
P-Store: Predict future load (SPAR)
PAGE 323
Predicting the Future QB5000When, how many, and what queries will arrive?
(Ma et al., SIGMOD 2018)
Pre-process: remove parameters, creating templates
SELECT * FROM C WHERE id = “C1”
SELECT * FROM C WHERE id = $
PAGE 324
Predicting the Future QB5000When, how many, and what queries will arrive?
(Ma et al., SIGMOD 2018)
Cluster: group templates by arrival rate
T1 T2 T3
PAGE 325
Predicting the Future QB5000When, how many, and what queries will arrive?
(Ma et al., SIGMOD 2018)
Forecast: Predict clusters arrival rate (Ensemble of RNN, LR, KR)
PAGE 326
Predicting the Future QB5000When, how many, and what queries will arrive?
(Ma et al., SIGMOD 2018)
Pre-process: remove parameters, creating templatesCluster: group templates by arrival rate
Forecast: Predict clusters arrival rate (Ensemble of RNN, LR, KR)
PAGE 327
Predicting the FutureHow can your system predict its future workload?
Apollo: Predict future queries (Markov Model)
P-Store: Predict future load (SPAR)
QB5000: Predict query workloads (Ensemble of RNN, LR, KR)
PAGE 328
If your system knew the future workload, how could it partition and replicate data?
Predicting the FutureHow can your system predict its future workload?
PAGE 329