preventive replication in database cluster esther pacitti, cedric coulon, patrick valduriez, m....
TRANSCRIPT
![Page 1: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/1.jpg)
Preventive Replication in Database Cluster
Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group
University of Nantes - France* University of Waterloo - Canada
July 2005
![Page 2: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/2.jpg)
2
LINA / INRIA – Atlas Group
Outline
Motivations Cluster Architecture Preventive Replication Multi-Master Partially Replicated configurations Replication Manager Architecture Optimizations RepDB* Prototype Experiments Conclusions, Current and Future Work
![Page 3: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/3.jpg)
3
LINA / INRIA – Atlas Group
Motivations
Applications and Data are asynchronously replicated among a set of cluster nodes connected by a fast and reliable network to improve users requests response times
Use of lazy preventive replication to enforce data consistency
Cluster of n PC nodes
External Users Requests
![Page 4: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/4.jpg)
4
LINA / INRIA – Atlas Group
Cluster system architecture
Fast NetworkNode 1
Node 2
Request Router
Replication Manager
Transaction LoadBalancer
Application Manager
DBMS
CurrentLoadMonitor
Node n
Global UserDirectory
GlobalDataPlacement
Cluster Architecture
![Page 5: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/5.jpg)
5
LINA / INRIA – Atlas Group
Preventive Replication (1)
Properties: Strong consistency Non-blocking Scale and Speeds Up Highly High Data availability
![Page 6: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/6.jpg)
6
LINA / INRIA – Atlas Group
Preventive Replication (2)
Assumptions: Network interface provides FIFO reliable
multicast Max is the upper bound of time needed to
multicast a message from a node i and to be received at a receiving node j
Clocks are -synchronized Each transaction has a timestamp C value
(arrival time)
![Page 7: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/7.jpg)
7
LINA / INRIA – Atlas Group
Preventive Replication (3)
Consistency Criteria Total Order Enforcement: Transactions are received in the same
order at all involved nodes: correspond to the execution order
To enforce total order, transactions are chronologically ordered at each node using its delivery_time value:
delivery_time = C + Max + ε
T is received at node i
node i
Wait untildelivery_time
T
node j
![Page 8: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/8.jpg)
8
LINA / INRIA – Atlas Group
Preventive Replication (4)
Whenever a node i receives T Propagation: It multi-cast T to all nodes
including itself
Scheduling: At each node T’s delivery-time expires if and only if it is the older transaction
Execution: When T’s delivery-time expires then T is entirely executed
![Page 9: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/9.jpg)
9
LINA / INRIA – Atlas Group
Partial Architecture
![Page 10: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/10.jpg)
10
LINA / INRIA – Atlas Group
R
S
r', s'
r'', s''
R1, S
1
R2, S
2
R3, S
3
R4, S
4
Bowtie Fully replicated
Partially replicated
R1, S
1
S2
R2
Partially replicated
R1, S
R2, s'
R3
s''
Preventive Replication (4)
PRIMARY copies (R): Can be updated only on master node
Secondary copies (r): read-only
MULTIMASTER copies (R1): Can be updated on more than one node
![Page 11: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/11.jpg)
11
LINA / INRIA – Atlas Group
Preventive Replication (5)
Introduces Max + ε delay time Negligible in Cluster Networks Critical in bursty workloads
Data placement restrictions Lazy-Master, Fully replicated
In Fully-Replicated Overhead of message exchanges Not all nodes may have enough place to stores all replicas
=> Free data placement
![Page 12: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/12.jpg)
12
LINA / INRIA – Atlas Group
In the case where all data are not fully replicated, some transactions cannot be executed on target nodes
Example:UPDATE r SET c1 WHERE c2 IN (SELECT c3 FROM s);
N2
T1(R, S)
R1, S
1
S2
R2
N3
N1
Partially Replicated Configurations (1)
![Page 13: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/13.jpg)
13
LINA / INRIA – Atlas Group
On target nodes, T1 waits after its selection (Step 3) At the end of the execution on the origin node, a Refresh
Transaction (RT1) is multicast to target nodes (Step 4) RT1 is executed to update replicated data
R1, S
1
R2
S2
N1
N2 N3
ClientT
1(r
S, w
R)
R1, S
1
R2
S2
N1
N2 N3
R1, S
1
R2
S2
N1
N2 N3
Client
Answer T1
R1, S
1
R2
S2
N1
N2 N3
Step 1
R1, S
1
R2
S2
N1
N2 N3
Step 2 Step 3 Step 4 Step 5
T1(r
S, w
R)
Standby
RT1(w
R)
Perform
Partially Replicated Configurations (2)
![Page 14: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/14.jpg)
14
LINA / INRIA – Atlas Group
Data Placement
Tables must have a Primary Key A node i can not hold primary copies which has Foreign
keys of others tables which are not held by node i
ITEM,ORDER
ORDER(On N3, a order can be done on an item which doesn’t exist)
N3
N1
orderN2
![Page 15: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/15.jpg)
15
LINA / INRIA – Atlas Group
Replication Manager Architecture
![Page 16: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/16.jpg)
16
LINA / INRIA – Atlas Group
Optimization: Eliminating delay times (1)
In a cluster network, messages are naturally totally ordered
Schedule a transaction in parallel with its execution Submitting a transaction to execution as soon as it is
received Schedule the commit order of the transactions: A
transaction can be committed only after Max + ε Abort and re execute all younger transactions when a
transaction is received out of order Concurrent execution of non conflicting
transactions
![Page 17: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/17.jpg)
17
LINA / INRIA – Atlas Group
Optimization: Eliminating delay times (2)
Scheduling
Execution
T
Validation
Scheduling ValidationExecutionT
Abort
Preventive replication:
Optimized Preventive Replication:
![Page 18: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/18.jpg)
18
LINA / INRIA – Atlas Group
Optimisation Example (3)
![Page 19: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/19.jpg)
19
LINA / INRIA – Atlas Group
Optimization: Eliminating delay times (4)
Without the optimization, the refreshment time of a transaction T is always delayed by: Max + ε + t
With the optimization, the refreshment time of a transaction T is : Maximum((Max + ε), t), where t is the time spent to execute T
![Page 20: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/20.jpg)
20
LINA / INRIA – Atlas Group
RepDB* Prototype: Architecture
DBMSClients
ReplicaInterfaceJDBC server
LogMonitor
DBMS specific
Propagator Receiver
Refresher
Deliver
Network
JDBC JDBC
RepDB*
![Page 21: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/21.jpg)
21
LINA / INRIA – Atlas Group
RepDB* Prototype: Implementation
Java (around 10000 lines) DBMS is a black-box Interface JDBC (RMI-JDBC) Use of Spread toolkit to manage the network
(Center for Networking and Distributed Systems - CNDS)
Simulation version (SimJava) http://www.sciences.univnantes.fr/ATLAS/RepDB
![Page 22: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/22.jpg)
22
LINA / INRIA – Atlas Group
Replicas definition (1)
A file contains the replica placement specification:
<NODE name='node1'>
<MASTER>R</MASTER>
<MASTER>S</MASTER>
<SLAVE>T</SLAVE></NODE>
<NODE name='node2'>
<MASTER>R</MASTER>
<SLAVE>S</SLAVE>
</NODE>
![Page 23: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/23.jpg)
23
LINA / INRIA – Atlas Group
Interface: Applications / RepDB* (2)
Connection c;Statement s;Class.forName(“org.atlas.repdb.jdbc.Driver”);c = DriverManager.getConnection(
” jdbc:repdb://node0:4444/” , ”login”, ”password”);s = c.createStatement();s.executeUpdate(
“<WRITE>R, S</WRITE><READ>T</READ>“ + “UPDATE R SET att2 = 1 WHERE att1 IN “ +“ (SELECT att3 FROM T); “+“UPDATE S SET att2 = 1 WHERE att1 NOT IN “ +“ (SELECT att3 FROM T);” );
s.close(); c.close();
![Page 24: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/24.jpg)
24
LINA / INRIA – Atlas Group
Experiments (1): TPC-C benchmark
1 / 5 / 10 Warehouses 10 clients per Warehouse Transactions’ arrival rate is 1s / 200ms /
100ms 4 types of transactions:
New-order: Read-Write, high frequency (45%) Payment: Read-Write, high frequency (45%) Order-status: Read, low frequency (5%) Stock-level: Read, low frequency (5%)
![Page 25: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/25.jpg)
25
LINA / INRIA – Atlas Group
Experiments (2)
Cluster of 64 nodes PostgreSQL 7.3.2 1 Gb/s network
2 Configurations Fully Replicated (FR) Partially Replicated (PR): each type of TPC-
C transaction runs using ¼ of the nodes.
![Page 26: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/26.jpg)
26
LINA / INRIA – Atlas Group
Experiments (3): Scale up
a) Fully Replicated (FR) b) Partially Replicated (PR)
![Page 27: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/27.jpg)
27
LINA / INRIA – Atlas Group
Experiments (4): Speed up
+ Launch 128 clients that submit Order-status transactions (read-only)
a) Fully Replicated (FR) b) Partially Replicated (PR)
![Page 28: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/28.jpg)
28
LINA / INRIA – Atlas Group
Experiments (5): Unordored messages
a) Fully Replicated (FR) b) Partially Replicated (PR)
![Page 29: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/29.jpg)
29
LINA / INRIA – Atlas Group
Experiments (6): Delay x Trans. size
![Page 30: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/30.jpg)
30
LINA / INRIA – Atlas Group
Conclusions
Preventive replication Strong consistency Prevents conflicts for partially replicated
databases Full node autonomy Scale and Seeps up Experiments show the configuration and the
placement of the copies should be tuned to selected types of transactions
![Page 31: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/31.jpg)
31
LINA / INRIA – Atlas Group
Current andFuture Work
Preventive Replication for P2P systems Small and Dynamic multi-master groups Max is computed dynamically Small and dynamic slave groups
Optimistic Replication Distributed Semantic Reconcialiation
![Page 32: Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes](https://reader035.vdocuments.us/reader035/viewer/2022081603/56649f135503460f94c26b24/html5/thumbnails/32.jpg)
32
LINA / INRIA – Atlas Group
Thanks !
Merci !
Obrigado !
Questions ?