espresso database replication with kafka, tom quiggle

50
Distributed Data Systems 1 ©2016 LinkedIn Corporation. All Rights Reserved. ESPRESSO Database Replication with Kafka Tom Quiggle Principal Staff Software Engineer [email protected] www.linkedin.com/in/tquiggle @TomQuiggle

Upload: confluent

Post on 15-Apr-2017

1.462 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 1 ©2016 LinkedIn Corporation. All Rights Reserved.

ESPRESSO Database Replication with KafkaTom QuigglePrincipal Staff Software [email protected]/in/tquiggle@TomQuiggle

Page 2: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 2 ©2016 LinkedIn Corporation. All Rights Reserved.

ESPRESSO Overview– Architecture– GTIDs and SCNs– Per-instance replication (0.8)– Per-partition replication (1.0)

Kafka Per-Partition Replication– Requirements– Kafka Configuration– Message Protocol– Producer– Consumer

Q&A

Agenda

Page 3: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 3 ©2016 LinkedIn Corporation. All Rights Reserved.

ESPRESSO OverviewESPRESSO Database Replication with Kafka

Page 4: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 4 ©2016 LinkedIn Corporation. All Rights Reserved.

Hosted, Scalable, Data as a Service (DaaS) for LinkedIn’s Online Structured Data Needs

Databases are partitioned Partitions distributed across available hardware HTTP proxy routes requests to appropriate database node Apache Helix provides centralized cluster management

ESPRESSO1

1. Elastic, Scalable, Performant, Reliable, Extensible, Stable, Speedy and Operational

Page 5: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 5 ©2016 LinkedIn Corporation. All Rights Reserved.

ESPRESSO Architecture

Page 6: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 6 ©2016 LinkedIn Corporation. All Rights Reserved.

GTIDs and SCNs

MySQL 5.6 Global Transaction Identifier Unique, monotonically increasing, identifier for each transaction

committed GTID :== source_id:transaction_id ESPRESSO conventions

– source_id encodes database name and partition number– transaction_id is a 64 bit numeric value

High Order 32 bits is generation count Low order 32 bit are sequence within generation

– Generation increments with every change in mastership– Sequence increases with each transaction– We refer to a transaction_id component as a Sequence Commit Number (SCN)

Page 7: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 7 ©2016 LinkedIn Corporation. All Rights Reserved.

GTIDs and SCNs

Example binlog transaction:SET @@SESSION.GTID_NEXT= 'hash(db_part):(gen<<32 + seq)';SET TIMESTAMP=<seconds_since_Unix_epoch>BEGINTable_map: `db_part`.`table1` mapped to number 1234Update_rows: table id 1234BINLOG '...'BINLOG '...'Table_map: `db_part`.`table2` mapped to number 5678Update_rows: table id 5678BINLOG '...'COMMIT

Page 8: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 8 ©2016 LinkedIn Corporation. All Rights Reserved.

Node 1

P1 P2 P3

Node 2

P1 P2 P3

Node 3

P1 P2 P3

Node 1

P4 P5 P6

Node 2

P4 P5 P6

Node 3

P4 P5 P6

ESPRESSO: 0.8 Per-Instance Replication

Master

Slave

Offline

Page 9: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 9 ©2016 LinkedIn Corporation. All Rights Reserved.

ESPRESSO: 0.8 Per-Instance Replication

Node 1

P1 P2 P3

Node 2

P1 P2 P3

Node 3

P1 P2 P3

Node 1

P4 P5 P6

Node 2

P4 P5 P6

Node 3

P4 P5 P6

Master

Slave

Offline

Page 10: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 10 ©2016 LinkedIn Corporation. All Rights Reserved.

Node 1

P1 P2 P3

Node 2

P1 P2 P3

Node 3

P1 P2 P3

Node 1

P4 P5 P6

Node 2

P4 P5 P6

Node 3

P4 P5 P6

Master

Slave

Offline

ESPRESSO: 0.8 Per-Instance Replication

Page 11: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 11 ©2016 LinkedIn Corporation. All Rights Reserved.

Issues with Per-Instance Replication

Poor resource utilization – only 1/3 of nodes service application requests

Partitions unnecessarily share fate Cluster expansion is an arduous process Upon node failure, 100% of the traffic is redirected to one node

Page 12: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 12 ©2016 LinkedIn Corporation. All Rights Reserved.

ESPRESSO: 1.0 Per-Partition ReplicationPer-Instance MySQL replication replaced with Per-Partition Kafka

HELIX

P4:Master: 1Slave: 3…

EXTERNALVIEW

Node 1Node 2Node 3

LIVEINSTANCESNode 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Kafka

Page 13: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 13 ©2016 LinkedIn Corporation. All Rights Reserved.

Cluster ExpansionInitial State with 12 partitions, 3 storage nodes, r=2

HELIX

EXTERNALVIEW

Node 1Node 2Node 3

LIVEINSTANCESNode 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Master

Slave

Offline

P4:Master: 1Slave: 3…

Page 14: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 14 ©2016 LinkedIn Corporation. All Rights Reserved.

Cluster ExpansionAdding Node: Helix Sends OfflineToSlave for new partitions

HELIX

EXTERNALVIEW

Node 1Node 2Node 3Node 4

LIVEINSTANCESNode 1

P1 P2

P4

P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Node 4

P4 P8

P1

P12

P7 P9

P4:Master: 1Slave: 3Offline: 4…

Page 15: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 15 ©2016 LinkedIn Corporation. All Rights Reserved.

Cluster ExpansionOnce a new partition is ready, transfer ownership and drop old

HELIX

EXTERNALVIEW

Node 1Node 2Node 3Node 4

LIVEINSTANCESNode 1

P1 P2 P3

P5 P6

P9 P10

Node 2

P5 P6

P8

P7

P1 P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Node 4

P4 P8

P1

P12

P7 P9

P4:Master: 4Slave: 3…

Page 16: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 16 ©2016 LinkedIn Corporation. All Rights Reserved.

Cluster ExpansionContinue migration of master and slave partitions

HELIX

EXTERNALVIEW

Node 1Node 2Node 3Node 4

LIVEINSTANCESNode 1

P1 P2 P3

P5 P6

P9 P10

Node 2

P5 P6 P7

P2

P11 P12

Node 3

P9 P10

P12

P11

P3 P4

P7 P8

Node 4

P4 P8

P1

P12

P7 P9

P9:Master: 3Slave: 1Offline: 4…

Page 17: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 17 ©2016 LinkedIn Corporation. All Rights Reserved.

Cluster ExpansionRebalancing is complete after last partition migration

HELIX

EXTERNALVIEW

Node 1Node 2Node 3Node 4

LIVEINSTANCESNode 1 Node 2

Node 3 Node 4

P4 P8

P1

P12

P7 P9

P5 P6

P2

P7

P11 P12

P9 P10

P3

P11

P4 P8

P1 P2

P5

P3

P6 P10

P9:Master: 4Slave: 3…

Page 18: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 18 ©2016 LinkedIn Corporation. All Rights Reserved.

Node FailoverDuring failure or planned maintenance, promote slaves to master

HELIX

EXTERNALVIEW

Node 1Node 2Node 3Node 4

LIVEINSTANCESNode 1 Node 2

Node 3 Node 4

P4 P8

P1

P12

P7 P9

P5 P6

P2

P7

P11 P12

P9 P10

P3

P11

P4 P8

P1 P2

P5

P3

P6 P10

P9:Master: 4Slave: 3…

Page 19: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 19 ©2016 LinkedIn Corporation. All Rights Reserved.

Node FailoverDuring failure or planned maintenance, promote slaves to master

HELIX

EXTERNALVIEW

Node 1Node 2Node 4

LIVEINSTANCESNode 1 Node 2

Node 3 Node 4

P4 P8

P1

P12

P7 P9

P5 P6

P2

P7

P11 P12

P9 P10

P3

P11

P4 P8

P1 P2

P5

P3

P6 P10

P9:Master: 4Offline: 3…

Page 20: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 20 ©2016 LinkedIn Corporation. All Rights Reserved.

Advantages of Per-Partition Replication

Better hardware utilization– All nodes service application requests

Mastership hand-off done in parallel After node failure, can restore full replication factor in parallel Cluster expansion is as easy as:

– Add node(s) to cluster– Rebalance

Single platform for all Change Data Capture– Internal replication– Cross-colo replication– Application CDC consumers

Page 21: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 21 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka Per-Partition Replication

ESPRESSO Database Replication with Kafka

Page 22: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 22 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka for Internal Replication

Page 23: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 23 ©2016 LinkedIn Corporation. All Rights Reserved.

Requirements

Delivery Must Be: Guaranteed In-Order Exactly Once (sort of)

Page 24: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 24 ©2016 LinkedIn Corporation. All Rights Reserved.

Broker Configuration

Replication factor = 3(most LinkedIn clusters use 2)

min.isr=2 Disable unclean leader elections

Page 25: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 25 ©2016 LinkedIn Corporation. All Rights Reserved.

B – Begin txnE – End txnC – Control

Message ProtocolMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104DB_0: 3:104E

Page 26: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 26 ©2016 LinkedIn Corporation. All Rights Reserved.

Message Protocol – Mastership HandoffOld Master

MySQL

ProducerConsumer

Promoted Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104DB_0: 3:104E

4:0C

Page 27: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 27 ©2016 LinkedIn Corporation. All Rights Reserved.

Message Protocol – Mastership HandoffMaster

MySQL

ProducerConsumer

Promoted Slave

MySQL

ProducerConsumer

Consumed own control message

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104DB_0: 3:104E

4:0C

Page 28: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 28 ©2016 LinkedIn Corporation. All Rights Reserved.

Message Protocol – Mastership HandoffOld Master

MySQL

ProducerConsumer

Master

MySQL

ProducerConsumer

Enable writes with new gen

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104DB_0: 3:104E

4:0C

4:0B

Page 29: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 29 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka Producer Configuration

acks = “all” retries = Integer.MAX_VALUE block.on.buffer.full=true max.in.flight.requests.per.connection=1 linger=0 On non-retryable exception:

– destroy producer– create new producer– resume from last checkpoint

Page 30: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 30 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka Producer CheckpointingMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104

Can’tCheckpoint

Here

Periodically writes (SCN, Kafka Offset) to MySQL tableMay only checkpoint offset at end of valid transaction!

Page 31: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 31 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka Producer CheckpointingMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104

Producer checkpoint will lag current producer Kafka OffsetKafka Offset obtained from callback

LastCheckpoint

Here

Page 32: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 32 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka Producer CheckpointingMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Last

CheckpointHere

send()FAILS

X

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:104

Page 33: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 33 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka Producer CheckpointingMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Recreate producer and resume from last checkpoint

Resume From

Checkpoint

Messages will be replayed

3:102B

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104

Page 34: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 34 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka Producer CheckpointingMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Kafka stream now contains replayed transactions(possibly including partial transactions)

Can Checkpoint

Here

Replayed Messages

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:102B

3:102 3:102E

3:103B,E

3:104B

3:104

Page 35: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 35 ©2016 LinkedIn Corporation. All Rights Reserved.

Partition 3

Kafka Consumer

Uses Low Level Consumer Consume Kafka partitions slaved on node

Partition 1

Partition 2

Kafka Broker A

Kafka Broker B

Kafka Consumer

poll()

Consumer Thread

EspressoKafkaConsumer

EspressoReplicationApplier

MySQL

P1

P2

P3

ApplierThreads

Page 36: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 36 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka ConsumerMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104

Slave updates (SCN, Kafka Offset) row for every committed txn

3:101@2

Page 37: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 37 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka ConsumerMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Client only applies messages with SCN greater than last committed

Replayed Messages

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:102B

3:102 3:102E

3:103B,E

3:104B

3:104

BEGINTransaction

3:104

3:103@6

Page 38: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 38 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka ConsumerMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Incomplete transaction is rolled back

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:102B

3:102 3:102E

3:103B,E

3:104B

3:104

ROLLBACK3:104

Replayed Messages

3:103@6

Page 39: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 39 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka ConsumerMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

Client only applies messages with SCN greater than last committed

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:102B

3:102 3:102E

3:104B

3:104

SKIP3:102..3:10

3

Replayed Messages

3:103@6

3:103B,E

Page 40: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 40 ©2016 LinkedIn Corporation. All Rights Reserved.

Kafka ConsumerMaster

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:101B,E

3:102B

3:102 3:102E

3:100B,E

3:103B,E

3:104B

3:104 3:102B

3:102 3:102E

3:104B

3:104

Replayed Messages

BEGIN3:104

(again)

3:104E

3:103@6

3:103B,E

Page 41: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 41 ©2016 LinkedIn Corporation. All Rights Reserved.

Zombie Write Filtering

What if stalled master continues writing after transition?

Page 42: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 42 ©2016 LinkedIn Corporation. All Rights Reserved.

Zombie Write FilteringMASTER

MySQL

ProducerConsumer

Slave

MySQL

ProducerConsumer

3:102B

3:102 3:102E

3:103B,E

3:104B

3:104

Master Stalled

Page 43: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 43 ©2016 LinkedIn Corporation. All Rights Reserved.

Zombie Write FilteringMaster

MySQL

ProducerConsumer

Promoted Slave

MySQL

ProducerConsumer

3:102B

3:102 3:102E

3:103B,E

3:104B

3:104

Master Stalled4:0C

Helix sends SlaveToMaster transition to one of the slaves

Page 44: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 44 ©2016 LinkedIn Corporation. All Rights Reserved.

Zombie Write FilteringMaster

MySQL

ProducerConsumer

New Master

MySQL

ProducerConsumer

Master Stalled3:102

B3:102 3:102

E3:103B,E

3:104B

3:104 4:0C

4:1B,E

4:2B

4:2E

Slave becomes master and starts taking writes

Page 45: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 45 ©2016 LinkedIn Corporation. All Rights Reserved.

Zombie Write FilteringMaster

MySQL

ProducerConsumer

New Master

MySQL

ProducerConsumer

3:102B

3:102 3:102E

3:103B,E

3:104B

3:104 4:0C

4:1B,E

4:2B

4:2E

3:104E

3:105B,E

Stalled Master resumes and sends binlog entries to Kafka

Page 46: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 46 ©2016 LinkedIn Corporation. All Rights Reserved.

Zombie Write FilteringERROR

MySQL

ProducerConsumer

New Master

MySQL

ProducerConsumer

3:102B

3:102 3:102E

3:103B,E

3:104B

3:104 4:0C

4:1B,E

4:2B

4:2E

3:104E

3:105B,E

4:3B,E

Former master goes into ERROR stateZombie writes filtered by all consumers based on increasing SCN rule

Page 47: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 47 ©2016 LinkedIn Corporation. All Rights Reserved.

Current Status

ESPRESSO Database Replication with Kafka

Page 48: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 48 ©2016 LinkedIn Corporation. All Rights Reserved.

ESPRESSO Kafka Replication: Current Status

Pre-Production integration environment migrated to Kafka replication 8 production clusters migrated (as of 4/11) Migration will continue through Q3 of 2016 Average replication latency < 90ms

Page 49: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems 49 ©2016 LinkedIn Corporation. All Rights Reserved.

Conclusions

Configure Kafka for reliable, at least once, delivery. See:

http://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844

Carefully control producer and consumer checkpoints along txn boundaries

Embed sequence information in message stream to implement exactly-once application of messages

Page 50: Espresso Database Replication with Kafka, Tom Quiggle

Distributed Data Systems

Even our workspace isHorizontally Scalable!