multi-cluster and failover for apache kafka - kafka summit sf 17

33
1 One Data Center is Not Enough Scale and Availability of Apache Kafka in Multiple Data Centers @gwenshap

Upload: gwen-chen-shapira

Post on 28-Jan-2018

1.601 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

1

One Data Center is Not Enough

Scale and Availability of Apache Kafka in Multiple Data Centers

@gwenshap

Page 2: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

2

Page 3: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

3

Bad Things

• Kafka cluster failure

• Major storage / network outage

• Entire DC is demolished

• Floods and Earthquakes

Page 4: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

4

Disaster Recovery Plan:

“When in trouble

or in doubt

run in circles,

scream and shout”

Page 5: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

5

Disaster Recovery Plan:

When This Happens Do That

Kafka cluster failure Failover to a second cluster in same data

center

Major storage / network Outage Failover to a second cluster in another “zone”

in same building

Entire data-center is demolished Single Kafka cluster running in multiple near-by

data-centers / buildings.

Flood and Earthquakes Failover to a second cluster in another region

Page 6: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

6

There is no such thing

as a free lunch

Anyone who tells you differently

is selling something

Page 7: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

7

Reality:

The same event will not

appear in two DCs at the

exact same time.

Page 8: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

8

Things to ask:

• What are the guarantees in an event of unplanned failover?

• What are the guarantees in an event of planned failover?

• What is the process for failing back?

• How many data-centers are required?

• How does the solution impact my production performance?

• What are the bandwidth requirements between the data-centers?

Page 9: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

9

Every solution needs to balance

these trade offs

Kafka takes DIY approach

Page 10: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

11

Stretch ClusterThe easy way

• Take 3 nearby data centers.

• Single digit ms latency is good

• Install at least 1 Zookeeper in each

• Install at least one Kafka broker in each

• Configure each DC as a “rack”

• Configure acks=all, min.isr=2

• Enjoy

Page 11: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

12

Diagram!

Page 12: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

13

Pros

• Easy to set up

• Failover is “business as usual”

• Sync replication – only method to guarantee no loss of data.

Cons

• Need 3 data centers nearby

• Cluster failure is still a disaster

• Higher latency, lower throughput compared to “normal” cluster

• Traffic between DCs can be bottleneck

• Costly infrastructure

Page 13: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

14

Want sync replication but only

two data centers?

Page 14: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

15

Solution I hesistate because…

2 ZK nodes in each DC and “observer”

somewhere else.

Did anyone do this before?

3 ZK nodes in each DC and manually

reconfigure quorum for failover

• You may lose ZK updates during

failover

• Requires manual intervention2 separate ZK cluster + replication

Solutions I can’t recommend:

Page 15: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

16

Most companies don’t do stretch.

Because:

• Only 2 data centers

• Data centers are far

• One cluster isn’t safe enough

• Not into “high latency”

Page 16: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

17

So you want to run2 Kafka clustersAnd replicate eventsbetween them?

Page 17: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

18

Basic async replication

Page 18: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

19

Replication Lag

Page 19: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

20

Demo #1

Monitoring Replication Lag

Page 20: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

21

Active-Active or Active-Passive?

• Active-Active is efficientyou use both DCs

• Active-Active is easier because both clusters are equivalent

• Active-Passive has lower network traffic

• Active-Passive requires less monitoring

Page 21: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

22

Active-Active Setup

Page 22: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

23

Disaster Strikes

Page 23: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

24

Desired Post-Disaster State

Page 24: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

25

Only one question left:

What does it consume next?

Page 25: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

26

Kafka consumers normally use offsets

Page 26: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

27

In an ideal world…

Page 27: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

28

Unfortunately, this is not that simple

1. There is no guarantee that offsets are identical in the two data centers. Event with offset 26 in NYC can be offset 6 or offset 30 in ATL.

2. Replication of each topic and partition is independent. So..

1. Offset metadata may arrive ahead of events themselves

2. Offset metadata may arrive late

Nothing prevents you from replicating offsets topic and using it. Just be realistic about the guarantees.

Page 28: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

29

If accuracy is no big-deal…

1. If duplicates are cool – start from the beginning. Use Cases:• Writing to a DB

• Anything idempotent

• Sending emails or alerts to people inside the company

2. If lost events are cool – jump to the latest event.Use Cases:• Clickstream analytics

• Log analytics

• “Big data” and analytics use-cases

Page 29: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

30

Personal Favorite – Time-based Failover

• Offsets are not identical, but…3pm is 3pm (within clock drift)

• Relies on new features:

• Timestamps in events! 0.10.0.0

• Time-based indexes! 0.10.1.0

• Force consumer to timestamps tool! 0.11.0.0

Page 30: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

31

How we do it?

1. Detect Kafka in NYC is down. Check the time of the incident.

• Even better:

Use an interceptor to track timestamps of events as they are

consumed. Now you know “last consumed time-stamp”

2. Run Consumer Groups tool in ATL and set the offsets for “following-orders”

consumer to time of incident (or “last consumed time”)

3. Start the ”following-orders” consumer in ATL

4. Have a beer. You just aced your annual failover drill.

Page 31: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

32

bin/kafka-consumer-groups

--bootstrap-server localhost:29092

--reset-offsets

--topic NYC.orders

--group following-orders

--execute

--to-datetime 2017-08-22T06:00:33.236

Page 32: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

33

Few practicalities

• Above all – practice

• Constantly monitor replication lag. High enough lag and everything is useless.

• Also monitor replicator for liveness, errors, etc.

• Chances are the line to the remote DC is both high latency and low throughput.

Prepare to do some work to tune the producers/consumers of the replicator.

• RTFM: http://docs.confluent.io/3.3.0/multi-dc/replicator-tuning.html

• Replicator plays nice with containers and auto-scale. Give it a try.

• Call your legal dept. You may be required to encrypt everything you replicate.

• Watch different versions of this talk. We discuss more architectures and more ops concerns.

Page 33: Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17

34

Thank You!