diy: a distributed database cluster, or: mysql cluster

MySQL Cluster talk

DIYNo Best Practices

No Product Presentation… you have been warned.

N marketing fluff

Foreword and disclaimer

Do it yourself, become a maker, get famous!

In this course you will learn how to create an eager update anywhere cluster. You need:

● A soldering iron, solder● Wires (multiple colors recommended)● A collection of computers

By the end of the talk you can either challenge MySQL, or get MySQL Cluster for free – it's Open Source, as ever since. Get armed with the distributed system theory you, as a developer, need to master any distributed database.

DIY – Distributed Database Cluster, or: MySQL Cluster

Ulf Wendel, MySQL/Oracle

N marketing fluff

Live on stage:Making a Cluster

The speaker says...

Beautiful work, but unfortunately the DIY troubles begin before the first message has been delivered in our cluster.

Long before we can speak about the latest hats fashion, we have to fix wiring and communication! Communication should be:• Fast• Reliable (loss, retransmission, checksum, ordering)• Secure

Network performance is a limiting factor for distributed systems. Hmm, we better go back to the drawing board before we mess up more computers...

Availability• Cluster as a whole unaffected by loss of nodes

Scalability• Geographic distribution

• Scale size in terms of users and data

• Database specific: read and/or write load

Distribution Transparency• Access, Location, Migration, Relocation (while in use)

• Replication

• Concurrency, Failure

Back to the beginning: goals

The speaker says...

A distributed database cluster strives for maximum availability and scalability while maintaining distribution transparency.

MySQL Cluster has a shared-nothing design good enough for 99,999% (five minutes downtime per year). It scales from Rasperry Pi run in a briefcase to 1.2 billion write transactions per second on a 30 data nodes cluster (if using possibly unsupported bleeding edge APIs.) It offers full distribution transparency with the exception of partition relocation to be triggered manually but performed transparently by the cluster. That's to beat. Let's learn what kind of clusters exist, how they tick and what the best algorithms are.

Where are transactions run?

Primary Copy Update Anywhere

When does synchronization happen?

EagerNot available for

MySQLMySQL Cluster

3rd party

LazyMySQL Replication

3rd partyMySQL Cluster

Replication

What kind of cluster?

The speaker says...

A wide range of clusters can be categorized by asking where transactions are run and when replicas synchronize their data. Any eager solution ensures that all replicas are synchronized at any time: it offers strong consistency. A transaction cannot commit before synchronization is done. Please note, what it means to transaction rates:

• Single computer tx rate ~ disk/fsync rate• Lazy cluster tx rate ~ disk/fsync rate• Eager cluster tx rate ~ network round-trip time (RTT)

Test: Would you deploy MySQL Cluster on Amazon EC2 :-) ?

Lazy Primary Copy we have...

010101001011010101010110100101101010010101010101010110101011101010110111101

Master (Primary)

Write

Slave (Copy) Slave (Copy) Slave (Copy)

Read

Read

Lazy synchronization: eventual consistency

Primary Copy: where any transaction may run

The speaker says...

MySQL Replication falls into the category of lazy Primary Copy clusters. It is a rather unflexible solution as all updates must be sent to the primary. However, this simplifies concurrency control of conflicting, concurrent update transactions. Concurrency control is no different from a single database.

Lazy replication can be fast. Transactions don't have to wait for synchronization of replicas. The price of the fast execution is the risk of stale reads and eventual consistency. Transactions can be lost when the primary crashes after commit and before any copy has been updated. (This is something you can avoid by using MySQL semi-sync replication, which delays the commit until delivery to copy.)

BTW, confusing: Multi-Master

Master (Primary)

Slave (Copy)

Master (Primary)

Slave (Copy)

SET A = 1 SET B = 1

A, B A, B

The speaker says...

Be aware of the term Multi-Master. MySQL Community sometimes uses it to describe a set of Primary Copy clusters where primaries (master) replicate from each other. This is one of the many possible topologies that you can build with MySQL Replication. In the example, the PC cluster on the left manages table A and the PC cluster on the right manages table B. The Primaries copy table A respectively table B from each other. There is no concurrency control and conflicts can arise. There is no distribution transparency. This is not an own kind of cluster with regards to our where and when criteria. And, it is rarely what you want...

Not a good goal for DIY – let's move on.

Let's do Eager Update Anywhere

010101001011010101010110100101101010010101010101010110101011101010110111101

Replica

Write

Replica Replica Replica

Read

Eager synchronization: strong consistency

Update Anywhere: any transaction can run on any replica

The speaker says...

An eager update anywhere cluster improves distribution transparency and removes the risk of reading stale data. Transparency and flexibility is improved because any transaction can be directed to any replica. Synchronization happens as part of the commit, thus strong consistency is achieved. Remember: transaction rate ~ network RTT. Failure tolerance is better than with Primary Copy. There is no single point of failure – the primary - that can cause a total outage of the cluster. Nodes may fail without bringing the cluster down immediately. Concurrency control (synchronization) is complex as concurrent transactions from different replicas may conflict.

Concurrency Control: 1SR

010101001011010101010110100101101010010101010101010110101011101010110111101

Replicat0: SET a = 1 Replica t

0: SET a = 2

One-Copy-Serializability (1SR) for correctness

• All replicas must decide on the same transaction order

a = 1

a = 2

a = 2a = 1

a = 1

010101001011010101010110100101101010010101010101010110101011101010110111101

The speaker says...

Concurrent ACID transactions must be isolated from each other to ensure correctness. The database system needs a mechanism to detect conflicts. If any, transactions need to be serialized. The challenge is to have all replicas commit transactions in the same serial order. One-Copy-Serializability (1SR) demands the concurrent execution of transactions in an replicated database to be equivalent to a serial execution of these transactions over a single logical copy of the database. 1SR is the highest level of consistency, lower exist, for example, snapshot isolation. Given that, the questions are:• How to detect conflicting transactions?• How to enforce a global total order?

Certification: detect conflict

Replica

Update transaction

Replica

Read query

Replica

Read set: a = 1

Write set: b = 12

Transactions get executed and certified before commit

• Conflict detection is based on read and write sets

• Multi-Primary deferred update

Certification Certification

The speaker says...

(For brevity we discuss multi-primary deferred update only.) In a multi-primary deferred update system a read query can be served by a replica without consulting any of the other replicas. A write transaction must be certified by all other replicas before it can commit. During the execution of the transaction, the replica records all data items read and written. The read/write sets are then forwarded by the replica to all other replicas to certify the remote transaction. The other replicas check whether the remote transaction includes data items modified by an active local transaction. The outcome of the certification decides on commit or abort. Either symetric (statement based) or asymetric (row based) replication can be used.

Concurrency Control

010101001011010101010110100101101010010101010101010110101011101010110111101

Replicat0: SET a = 1 Replica t

0: SET a = 2

Various synchronization mechanisms

• Atomic commit

• Atomic broadcast

• Strict two-phase locking (2PL)

• Optimistic, Physical clock, Lamport's clock, vector clock...

a = 1

a = 2

a = 1a = 1

a = 2

The speaker says...

One challenge remains: replicas must agree on a global total order for comitting transactions no matter in which order they receive messages.

We will discuss atomic commit (two-phase-locking) and atomic broadcast. The other approaches are out of scope.

Atomic commit for CC

Execute Committing PreCommit

Aborted

Comitted

Formula (background): serial execution, unnecessary aborts

The speaker says...

Atomic commit can be expressed as a state machine with the final states abort and commit. Once a transaction has been executed, it enters the committing state in which certification/voting takes place. Given the absence of conflicting concurrent transactions, a replica sets the transactions status to precommit. If all replicas precommit, the transaction is comitted, otherwise it is aborted.

Don't worry about the formula. It checks for concurrent transactions – as we did before – and ensures, in case of conflicts, that only one transaction can commit at a time. Problem: it may also do unnecessary aborts depending on message delivery order as it requires all servers to precommit->commit in the same order.

Atomic broadcast for CC

Atomic broadcast guarantees

• Agreement: if one server delivers a message, all will

• Total order: all servers deliver messages in the same order

Greatly simplified concurrency check

• Deterministic: no extra communication after local decision

The speaker says...

Atomic broadcast ensures that transaction are delivered in the same order to all replicas. Thus, certification of transactions is deterministic: all replicas will make the same decision about commit or abort because they all base their decision on the same facts. This in turn means that there is no need to coordinate the decisions of all replicas – all replicas will make the same decision.

A transaction does not conflict and thus will commit, if its executed after the commit of any other transaction, or its read set does not overlap with the write set of any other transaction. The formula is greatly simplified! Great for DIY!

Voting quorum: ROWA, or...?

Read-One Write-All is a special quorum

• Quorum constraints: NR + N

W > N, N

W > N/2

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Replica

Example: N= 12, read quorum NR = 3, write quorum N

W = 10

Replica Replica Replica

Example: N= 3, read quorum NR = 2, write quorum N

W = 2

The speaker says...

So far we have silently assumed a Read-One Write-All (ROWA) quorum for voting. Reads could be served locally because updates have been applied to all replicas.

Alternatively, we could make a rule that an update has to be agreed by and applied to half of the replicas plus one. This may be faster than achieving agreement among all replicas. However, for a correct read we now have to contact half of the replicas plus one and check whether they all give the same reply. If so, we must have read the latest version as the remaining, unchecked replicas form a minority that cannot be updated. The read quorum overlaps the write quorum by at least one element.

Voting quorum: ROWA!

ROWA almost always performs better

• Are Quorums an Alternative for Data Replication? (Jimenez-Peris et.al.)

• „The obvious conclusion from these results is that ROWAA is the best choice for a wide range of application scenarios. It offers good scalability (within the limitations of replication protocols), very good availability, and an acceptable communication overhead. It also has the significant advantage of being very simple to implement and very easy to adapt to configuration changes. For every peculiar loads and configurations, it is possible that some variation of quorum does better than ROWAA.“

• Background: scale out results from study

The speaker says...

Judging from the paper ROWA respectively Read-One Write-All-Available (ROWAA) is a promising approach. For example, it offers linear scalability for read only workloads but still remains competitive for mixed update and read loads. It requires a high write-to-read ratio before the various Quorum algorithms outperform ROWA on scalability. In sum: ROWA beats Quorums by a magnitude for read but does not drop by a magniture for write, and the web is read dominated. Scalability is one aspect. Quorums also help with availability – the studies finding is similar: ROWA is fine. DIY decision on currency control: ROWA, atomic broadcast. Quiz: name a system using Quorums? Riak! Next: Availability and Fault Tolerance.

Complex failure handling required

• Later evolution: Three-Phase Commit (3PC)

Fault Tolerance: 2PC

Coordinator Participant Participant

Vote Request

PreCommit

PreCommit

Vote Request

Global Commit

Commit

The speaker says...

When discussing atomic commit we have effectively shown the Two-Phase Commit (2PC) protocol. 2PC starts with a vote request multicasted from a coordinator to all participants. The participants either vote to commit (precommit) or abort. Then, the coordinator checks the voting result. If all voted to commit, it sends a global commit messages and the participants commit. Otherwise the coordinator sends a global abort command. Various issues may arise in case of network or process failures. Some cannot be cured using timeouts. For example, consider the situation when a participant precommits but gets no global commit or global abort. The participant cannot uniliterally leave the state. At best, it can ask another participant what to do.

Two-Phase Commit is a blocking protocol

Fault Tolerance: 2PC

Coordinator Participant Participant

Vote Request

PreCommit

PreCommit

Vote Request

The speaker says...

The worst case scenario is a crash of the coordinator after all participants have voted to precommit. The participants cannot leave the precommit state before the coordinator has recovered. They do not know whether all of them have voted to commit or not. Thus, they do not know whether a global commit or global abort has to be performed.

As none of them has received a message about the outcome of the voting, the participants cannot contact one another and ask for the outcome.

Two-Phase Commit is also known as a blocking protocol.

Reliable multicast/broadcast

• Build on the idea of group views and view changes

Virtual Synchrony

P1

P2

P3

P4

M1

M2

VC

M3

M4

G1 = {P1, P2, P3} G2 = {P1, P2, P3, P4}

The speaker says...

Virtual Synchrony is a mechanism that does not block. It is build around the idea of associating multicast messages with the notion of a group. A message is delivered to all members of a group but no other processes. Either the message is delivered to all members of a group or to none of them. All members of the group agree that they are part of the group before the message is multicasted (group view). In the example, M1...3 are associated with the group G1 = {P1, P2, P3}. If a process wants to join or leave a group a view change message is multicated. In the example, P4 wants to join the group and a VC message is send while M3 is still being delivered. Virtual Synchrony requires that either M3 is delivered to all of G1 before the view change takes place or to none.

View changes act as a message barrier

• Remember the issues with 2PC …?

Virtual Synchrony

P1

P2

P3

P4

M5

VC

M6

G2 = {P1, P2, P3, P4} G3 = {P1, P2, P3}

M7

M8

The speaker says...

There is only one condition under which a multicast message is allowed not to be delivered: if the sender crashed. Assume the processes continue working and multicast messages M5, M6, M7 to group G2 = {P1, P2, P3, P4}. While P4 sends M7 it crashes. P4 has managed to deliver its message to {P3}. The crash of P4 is noticed and a view change is triggered. Because Virtual Synchrony requires a message to be delivered to all members of the group associated with it but the sender crashed, P3 is free to drop M7 and the view change can take place.

A new group view G3 is established and messages can be exchanged again.

Wire: message ordering and fault tolerance

• Common choices: UDP or TCP over IP

Reliable, delivered vs. received

010101001011010101010110100101101010010101010101010110101011101010110111101

ReplicaReplica

Update 1 Update 2

t1: Update 1

t2: Update 2

t1: Update 2

t2: Update 1 (lost)

The speaker says...

Virtual Synchrony offers reliable multicast. Reliability can be best achieved using a protocol higher up on the OSI model. Isis, an early framework implementing Virtual Synchrony, has used TCP point to point connections if reliable service was requested. TCP is a connection oriented protocol (endpoint failures can be deteted easily) with error handling and message delivery in the order sent. However, using TCP only there are no ordering constraints between messages from any two senders. Those ordering constraints have to be implemented at the application layer. We say a message can be recieved on the network layer in a different order than its delivered to the application by the model discussed. Vector clocks can be used for global total ordering.

AB = Virtual Synchrony offering total-order delivery

• „Synchrony“ does not refer to temporal aspects

Atomic broadcast definition

P1

P2

P3

P4

M1

M2

Unordered delivery Ordered delivery

P1

P2

P3

P4

M1

M2

The speaker says...

Atomic broadcast means Virtual Synchrony used with total-order message ordering. When Virtual Synchrony was introduced back in the mid 80s, it was explicitly designed to allow other message orderings. For example, it should be able to support distributed applications that have a notion of finding messages that commute, and thus may be applied in an order different from the order sent to improve performace. If events are applied in different order on different processes, the system cannot be called synchronous any more – the inventors called it virtually synchronous.

However, recall we are only after total-ordering for 1SR.

Wash the brain without marketing fluff, split brain, done!

• System dependent... E.g. Isis failure detector was very basic

How to cook brains

P1

P2

P3

P4

M1

M2

n1({P1, P2, P3, P4]) = 4

VC

Split brain – Connection lost

n2({P1, P2}) = 2 < (n1/2)

The speaker says...

The failure of individual processes – or database replicas – has been discussed. The model has measures to handle them following using a fail stop approach. To conclude the discussion of fault tolerance we look at a situation called split brain: one half of the cluster lost connection to another half. Which shall survive? The answer is often implementation dependent. For example, the early Virtual Synchrony framework Isis has a rule that a new group view can only be installed if it contains n / 2 + 1 members with n being the number of members in the current group. In the example both halves would shut down. Brain splitting question: how many replicas would you project for a cluster if you don't know split brain implementation details?

In-core architecture

DIY: Hack MySQL (oh, oh), or...?

MySQL DBMS MySQL DBMS

Load Balancer

PECL/mysqlnd_ms MySQL Proxy

PHP PHP PHP

Reflector Reflector

Replicator Replicator

GCS

The speaker says...

Here's a generic architecture made of five components:

• Clients (PHP, Java, …) using well known interfaces • Load Balancer (for example PECL/mysqlnd_ms) • The actual database system• The reflector allows inspection and modification of on-

going transactions• The (distributed) replicator handling concurrency

control• The Group Communication System (GCS) provides

communication primitives such as multicast (GCS examples: Appia, JGroups – Java, Spread – C/C++)

Middleware architecture


Virtual DBMS Virtual DBMS

Load Balancer

Clients

Reflector Reflector

Replicator ReplicatorGCS

DBMS DBMS

The speaker says...

An in-core design requires support for a reflector by the database. Strictly speaking there is no API inside MySQL one can use. The APIs used for MySQL Replication are not sufficient. Nonetheless, MySQL Replication can be classified as in-core in our model. Due to the lack of an reflector API, the only third party product following an in-core design (Galera by Codership) has to patch the MySQL core.

Tungsten Replicator by Continuent is a Middleware design. Clients contact a virtual database. Requests are intercepted, parsed and replicated. The challenge is in the interception: statements using non-deterministic calls such as NOW() and TIME() must be taken care of.

Hybrid architecture


DBMS DBMS

Load Balancer

Clients

Reflector Plugin Reflector Plugin

Replicator Replicator

GCS

The speaker says...

In a hybrid architecture the reflector runs within the database process but the replicator layer is using extra processes.

It is not a perfect comparison as we will see later but for the sake of our model, we can classify MySQL Cluster as a hybrid architecture. The reflector is implemented as a storage engine. The replicator layer is using extra processes.

This design has some neat MySQL NDB Cluster specific benefits. If any MySQL product has NoSQL genes, it is MySQL Cluster.

Primary Copy Update Anywhere

EagerNot available for

MySQL

MySQL Cluster (Hybrid)

Galera (In-core)

Lazy

MySQL Replication (In-core)

Tungsten (Middleware)

MySQL Cluster Replication(Hybrid)

DIY: Summary

The speaker says...

Time for a summary before coding ants and compilers start their work. From a DIY perspective we can skip Lazy Primary Copy: it has simple concurrency control, it does not depend on network speed, it is great for flacky and slow WAN connections but it offers eventual consistency only (hint: enjoy PECL/mysqlnd_ms!), it has no means to scale writes. And, it exists – no karma...

An eager update anywhere solution offering the highest level of correctness (1SR) gives you strong consistency. It scales writes to some degree because they can be executed on any replica, which parallizes execution load. Commit performance is network bound.

Full Replication Partitial Replication

Read Scale Out

WriteScale OutCapability

MySQL Replication (Lazy Primary Copy,

In-core)

MySQL Cluster (Eager Update

Anywhere, Hybrid)

Tungsten (Primary Copy, Middleware)

Galera (Eager Update Anywhere,

In-core)

If 1SR - hard limit

DIY: The Master Class

The speaker says...

The DIY Master Class for maximum karma is a partial replication solution offering strong consistency. Partial replication is the only way to ultimately scale write requests. The explanation is simple: every write adds load to the entire cluster. Remember that writes need to be coordinated, remember that concurrency control involves all replicas (ROWA) or a good number of them (Quorum). Thus, every additional replica adds load to all others. The solution is to partition the data set and keep each partition on a subset of all replicas only. NoSQL calls it sharding, MySQL Cluster calls it partitioning. Partial replication – that's the DIY master piece, that will give you KARMA.

Availability• Shared-nothing, High Availability (99,999%)

• WAN Replication to secondary data centers

Scalability• Read and write through partial replication (partitioning)

• Distributed queries (parallize work), real-time guarantees

• Focus In-Memory with disk storage extension

• Sophisticated thread model for multi-core CPU

• Optimized for short transaction (hundrets of operations)

Distribution Transparency• SQL level: 100%, low-level interfaces available

MySQL (NDB) Cluster goals

The speaker says...

I am not aware of text books discussing partial replication theory in-depth. Thus, we have to reverse engineer an existing system. As this is a talk about MySQL Cluster, how about talking about MySQL Cluster finally?MySQL Cluster has originally been developed to serve telecommunication systems. It aims to parallize work as much as possible, hence it is a distributed database. It started as an in-memory solution but can store data on disk meanwhile. It runs best in environments offering low network latency, high network throughput and issuing short transactions. Applications should not demand complex joins. There is no chance you throw Drupal at it and Drupal runs super-fast out of the box! Let's see why...

SQL view: Cluster is yet another table storage engine

MySQL Cluster is a hybrid

MySQL MySQL

Load Balancer

Clients

Reflector Plugin = NDB Storage Engine

Replicator = NDB Data NodeGCS

The speaker says...

MySQL Cluster has a hybrid architecture. It consists of the green elements on the slide. The Reflector is implemented as a MySQL storage engine. From a SQL user's perspective, it is just another storage engine, similar to MyISAM, InnoDB or others (Distribution Transparency). From a SQL client perspective there is no change: all MySQL APIs can be used. The Reflector (NDB Storage Engine) runs as part of the MySQL process. The Replicator is a seperate process called NDB data node. Please note, node means process not machine. MySQL Cluster does not fit perfectly in the model: an NDB data node combines Replicator and storage tasks.

BTW, what happens to Cluster if a MySQL Server fails?

Fast low-level access: bypassing the SQL layer

MySQL Cluster is a beast

MySQL MySQL

Load Balancer

Clients

Reflector Plugin = NDB Storage Engine

Replicator = NDB Data NodeGCS

Clients

4.3b read tx/s1.2b write tx/s

(in 2012)

The speaker says...

From the perspective of MySQL Cluster, a MySQL Server is yet another application client. MySQL Server happens to be an application that implements a SQL view on the relational data stored inside the cluster.

MySQL Cluster users often bypass the SQL layer by implementing application clients on their own. SQL is a rich query language but parsing a SQL query can take 30...50% of the total runtime of a query. Thus, bypassing is a good idea. The top benchmark results we show for Cluster are achieved using C/C++ clients directly accessing MySQL Cluster. There are many extra APIs for this special case: NDB API (C/C++, low level), ClusterJ (ORM style), ClusterJPA (low level), … - even for node.js (ORM style)

Partitioning (auto-sharding)

NDB Data Node 1 NDB Data Node 2


Partition 0, Primary

Partition 2, Copy

Partition 0, Copy


Partition 1, Primary Partition 1, Copy

Partition 3, Copy Partition 3, Primary

Node Group 1

Node Group 0

The speaker says...

There is a lot to say about how MySQL Cluster partitions a table and spreads it over nodes. The manual has all details, just all...

The key idea is to use an eager primary copy approach for partitions combined with a mindful distribution of each partitions primary and its copies. NDB supports zero or one copies (replication factor). The failure of a partitions primary does not cause a failure of the Cluster. In the example, the failure of any one node has no impact. Also, node 1 and 4 may fail without a stop of the Cluster (fail stop model). But the cluster shuts down if all nodes of a node group fail.

Concurrency Control: 2PL,“2PC“




Partition 2, Copy

Partition 0, Copy




W

R

R

The speaker says...

Buuuuh? Two-Phase-Locking (2PL) and Two-Phase-Commit (2PC) are used for concurrency control. Cluster is using traditional row locking to isolate transactions. Read and write locks can be distributed throughout the cluster. The locks are set on the primary partitions. Transactions are serialized during execution. When a transaction commits, an optimized Two-Phase-Commit is used to synchronize the partition copies.

The SQL layer recognizes the commit as soon as the copies are updated (and before logs have been written to disk). The low-level NDB C/C++ application API is asynchronous. Fire and forget is possible: your application can continue before transaction processing as even begun!

Brain Masala




Partition 2, Copy

Partition 0, Copy




Arbitrator

The speaker says...

The failure of a single node is detected using a hearthbeat protocol: details are documented, future improvements are possible. Both MySQL Cluster and Virtual Synchrony seperate message delivery from node failure detection.

The worst case scenario of a brain split is cured by the introduction of arbitrators. If the nodes split and each half is able to keep the Cluster up, the nodes try to contact the arbitrator. It is then up to the arbitrator to decide who stays up and who shuts down. Arbitrators are extra processes, ideally run on extra machines. Management nodes can act as arbitrators too. You need at least one management node for administration, thus you always have an arbitrator readily available.

Drupal? Sysbench? Oh, oh...




Partition 2, Copy

Partition 0, Copy




MySQL

The speaker says...

Partial replication (here: partitioning, sharding) is the only known solution to the write scale out problem. But, it comes at the high price of distributed queries.

A SQL query may require reading data from many partitions. One the one hand work is nicely parallized over many nodes on the other hand, records found have to be transferred within the cluster from one node to another. Although Cluster tries to batch requests efficiently together to minimize communication delays, transferring data from node to node to answer questions remains an expensive operation.

Oh, oh... tune your partitions!




Partition 2, Copy

Partition 0, Copy




MySQL

CREATE TABLE cities { id INT NOT NULL, Population INT UNSIGNED, city_name VARCHAR(100), PRIMARY KEY(city_name, id)}SELECT id FROM cities WHERE city_name = 'Kiel'

The speaker says...

How much traffic and latency occurs depends on the actual SQL query and the partitioning scheme. By default a table is partitioned into 3840 virtual fragments (think vBuckets) using its primary key. The partitioning can and should be tuned.

Try to find partitioning keys that make your common, expensive or time-criticial queries run on a single node. Assume you have a list of cities. City names are not unique, thus you have introduced a numeric primary key. It is likely that your most common query checks for the city name not for the numeric primary key only. Therefore, your partitioning should be based on city name as well.

The ultimate Key-Value-Store?




Partition 2, Copy

Partition 0, Copy




MySQL

CREATE TABLE cities { id INT NOT NULL, city_name VARCHAR(100), PRIMARY KEY(id)}SELECT FROM cities WHERE id = 1SELECT FROM citites WHERE id = 100

The speaker says...

I may have stated it before: if there is any product at MySQL that can compete with NoSQL (as in Key-Value-Store) on the issue of distributed data stores, it is MySQL Cluster.

An optimal query load for MySQL Cluster is one that primarily performs lookups on partition keys. Each query will execute on one node only. There is little traffic within the cluster – little network overhead. Work load is perfectly parallized.

Will your unmodified PHP application perform on Cluster?

Joins: 24...70x faster

Then

Now

NDB_API> read a from table t1 where pk = 1 [round trip] (a = 15)NDB_API> read b from table t2 where pk = 15 [round trip] (b = 30)[return a = 15, b = 30]

SELECT t1.a, t2.b FROM t1, t2 WHERE t1.pk = 1 AND t1.a = t2.pk

NDB_API> read @a=a from table t1 where pk = 1; read b from table t2 where pk = @a [round trip]

The speaker says...

In 7.2 we claim certain joins to execute 24...70x faster by the help of AQL (condition push-down)! How come?

Partial replication does not go together well with joins. Take this simple nested join as an example. There are two tables to join. The join condition of the second table depends on the values of the first table. Thus, t1 has to be searched before t2 can be searched and the result can be returned to the user. That makes two operations and two round trips.

As of 7.2, there is a new batched way of doing it. It saves round trips. Some round trips avoided means – at the extreme - 24...70x faster: the network is your enemy #1.

Benchmark pitfall: connections



MySQL

Load Balancer

Many, many clients

MySQL

NDB Storage Engine NDB Storage Engine

The speaker says...

If you ever come to the point of evaluating MySQL Cluster, make sure you configure MySQL to Cluster connections appropriately (ndb_cluster_connection_pool).

A MySQL Server with only one connection (default setting) from itself to the cluster may not be able to serve many concurrent clients at the rate the Cluster part itself might be able to handle them. The connection may an impose an artifical limitation on the cluster throughput.

Adding nodes, rebalancing



Partitions Partitions

Partitions Partitions


The speaker says...

Adding nodes, growing the capacity of your cluster in terms of size and computing power, is an online operation. At any time you can add nodes to your cluster.

New nodes do not immediately participate in operations. You have to tell the cluster what to do with them: use for new tables, or use for growing the capacity available to existing tables. When growing existing tables, data needs to be redistributed to the new nodes. Rebalancing is an online operation: it does not block clients. The partitioning algorithm used by Cluster ensures that data is copied to new nodes only, there is no traffic between nodes currently holding fragments of the table to be rebalanced.

We shall...• Code an Eager Update-Anywhere Cluster

• Prefer an hybrid design to get not too deep into MySQL

• Do not fear the lack of text books on partital replication

• Read CPU vendor tuning guides like comics

• Like Sweden or Finland

Send your application to the MySQL Cluster team.

Cluster is different. MySQL Cluster is perfect for web session storage. Whether your Drupal, WordPress, …runs faster is hard to tell – possibly not faster.

PS (marketing fluff): ask Sales for a show!

DIY - Summary

The speaker says...

By the end of this talk you should remember at least this:

● There are four kinds of replication solutions based on a matrix asking „where can all transactions run“ and „when are replicas synchronized“

● Clusters don't make everything faster – the network is your enemy. For read scale out there are proven solutions.

● Write scale out is only possible through partial replication (Small write Quorum would impact read performance)

THE END

Contact: [email protected]

The speaker says...

Thank you for your attendance!

Upcoming shows:

Talk&Show! (ask... :-))

YourPlace, any time

PHP SummitMunich, December 2013

diy: a distributed database cluster, or: mysql cluster

Technology

orget mysql cluster

availability cluster

distributed database

eager updateanywhere

data nodes cluster

pc cluster onthe

lazy primary copy

challenge mysql