multi-datacenter kafka - strata san jose 2017

39
When One Data Center Is Not Enough Building Large-scale Stream Infrastructures Across Multiple Data Centers with Apache Kafka Gwen Shapira

Upload: gwen-chen-shapira

Post on 11-Apr-2017

73 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Multi-Datacenter Kafka - Strata San Jose 2017

When One Data Center Is Not EnoughBuilding Large-scale Stream Infrastructures Across Multiple Data Centerswith Apache KafkaGwen Shapira

Page 2: Multi-Datacenter Kafka - Strata San Jose 2017

There’s a book on that!

Actually… a chapter

Page 3: Multi-Datacenter Kafka - Strata San Jose 2017

Outline

Kafka overviewCommon multi data center patterns Future stuff

Page 4: Multi-Datacenter Kafka - Strata San Jose 2017

What is Kafka?▪It’s like a message queue, right?-Actually, it’s a “distributed commit log”-Or “streaming data platform”

0 1 2 3 4 5 6 7 8

Data Source

Data Consumer

A

Data Consumer

B

Page 5: Multi-Datacenter Kafka - Strata San Jose 2017

Topics and Partitions▪Messages are organized into topics, and each topic is split into partitions.

- Each partition is an immutable, time-sequenced log of messages on disk.- Note that time ordering is guaranteed within, but not across, partitions.

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7 8

Partition 0

Partition 1

Partition 2

Data SourceTopic

Page 6: Multi-Datacenter Kafka - Strata San Jose 2017

Scalable consumption model

Topic T1Partition 0Partition 1

Partition 2Partition 3

Consumer Group 1

Consumer 1

Topic T1

Partition 0Partition 1

Partition 2Partition 3

Consumer Group 1Consumer 1

Consumer 2

Consumer 3

Consumer 4

Page 7: Multi-Datacenter Kafka - Strata San Jose 2017

Kafka usage

Page 8: Multi-Datacenter Kafka - Strata San Jose 2017

Common use case

Large scale real time data integration

Page 9: Multi-Datacenter Kafka - Strata San Jose 2017

Other use cases

Scaling databasesMessagingStream processing…

Page 10: Multi-Datacenter Kafka - Strata San Jose 2017

Important things to remember:

1. Consumers offset commits2. Within a cluster – each partition has replicas3. Inter-cluster replication, producer and consumer defaults – all tuned for LAN

Page 11: Multi-Datacenter Kafka - Strata San Jose 2017

Why multiple data centers (DC)

Offload work from main clusterDisaster recoveryGeo-localization

• Saving cross-DC bandwidth• Better performance by being closer to users• Some activity is just local• Security / regulations

CloudSpecial case: Producers with network issues

Page 12: Multi-Datacenter Kafka - Strata San Jose 2017

Why is this difficult?

1. It isn’t, really – you consume data from one cluster and produce to another2. Network between two data centers can get tricky3. Consumers have state (offsets) – syncing this between clusters get tough

• And leads to some counter intuitive results

Page 13: Multi-Datacenter Kafka - Strata San Jose 2017

Pattern #1: stretched cluster

Typically done on AWS in a single region• Deploy Zookeeper and broker across 3 availability zones

Rely on intra-cluster replication to replica data across DCs

Kafka

producers

consumers

DC 1

DC 3

DC 2 produce

rsproduce

rs

consumers

consumers

Page 14: Multi-Datacenter Kafka - Strata San Jose 2017

On DC failure

Producer/consumer fail over to new DCs• Existing data preserved by intra-cluster replication• Consumer resumes from last committed offsets and will see same data

Kafka

producers

consumers

DC 1

DC 3

DC 2 produce

rs

consumers

Page 15: Multi-Datacenter Kafka - Strata San Jose 2017

When DC comes back

Intra cluster replication auto re-replicates all missing dataWhen re-replication completes, switch producer/consumer back

Kafka

producers

consumers

DC 1

DC 3

DC 2 produce

rsproduce

rs

consumers

consumers

Page 16: Multi-Datacenter Kafka - Strata San Jose 2017

Be careful with replica assignment

Don’t want all replicas in same AZRack-aware support in 0.10.0

• Configure brokers in same AZ with same broker.rack

Manual assignment pre 0.10.0

Page 17: Multi-Datacenter Kafka - Strata San Jose 2017

Stretched cluster NOT recommended across regions

Asymmetric network partitioning

Longer network latency => longer produce/consume timeCross region bandwidth: no read affinity in Kafka

region 1Kafk

a ZK

region 2Kafk

a ZK

region 3Kafk

a ZK

Page 18: Multi-Datacenter Kafka - Strata San Jose 2017

Pattern #2: active/passive

Producers in active DCConsumers in either active or passive DC

Kafka

producers

consumers

DC 1

Replication

DC 2

Kafka

consumers

Critical Apps

Nice Reports

Page 19: Multi-Datacenter Kafka - Strata San Jose 2017

Cross Datacenter Replication

Consumer & Producer: read from a source cluster and write to a target clusterPer-key ordering preservedAsynchronous: target always slightly behindOffsets not preserved

• Source and target may not have same # partitions• Retries for failed writes

Options:• Confluent Multi-Datacenter Replication• MirrorMaker

Page 20: Multi-Datacenter Kafka - Strata San Jose 2017

On active DC failure

Fail over producers/consumers to passive clusterChallenge: which offset to resume consumption

• Offsets not identical across clusters

Kafka

producers

consumers

DC 1

Replication

DC 2

Kafka

Page 21: Multi-Datacenter Kafka - Strata San Jose 2017

Solutions for switching consumers

Resume from smallest offset• Duplicates

Resume from largest offset• May miss some messages (likely acceptable for real time consumers)

Replicate offsets topic• May miss some messages, may get duplicates

Set offset based on timestamp• Old API hard to use and not precise• Better and more precise API in Apache Kafka 0.10.1 (Confluent 3.1)• Nice tool coming up!

Preserve offsets during replication• Harder to do

Page 22: Multi-Datacenter Kafka - Strata San Jose 2017

When DC comes back

Need to reverse replication• Same challenge: determining the offsets

Kafka

producers

consumers

DC 1

Replication

DC 2

Kafka

Page 23: Multi-Datacenter Kafka - Strata San Jose 2017

Limitations

Reconfiguration of replication after failoverResources in passive DC under utilized

Page 24: Multi-Datacenter Kafka - Strata San Jose 2017

Pattern #3: active/active

Local aggregate replication to avoid cyclesProducers/consumers in both DCs

• Producers only write to local clusters

Kafka local

Kafka aggrega

te

Kafka aggrega

te

producers

producers

consumers

consumers

ReplicationKafka local

DC 1

DC 2

consumers

consumers

Page 25: Multi-Datacenter Kafka - Strata San Jose 2017

On DC failure

Same challenge on moving consumers on aggregate cluster• Offsets in the 2 aggregate cluster not identical• Unless the consumers are continuously running in both clusters

Kafka local

Kafka aggrega

te

Kafka aggrega

te

producers

producers

consumers

consumers

ReplicationKafka local

DC 1

DC 2

consumers

consumers

Page 26: Multi-Datacenter Kafka - Strata San Jose 2017

SFKafka

Cluster

HoustonKafka

Cluster

Allapps

Allapps

West coastUsers

South CentralUsers

Page 27: Multi-Datacenter Kafka - Strata San Jose 2017

When DC comes back

No need to reconfigure replication

Kafka local

Kafka aggrega

te

Kafka aggrega

te

producers

producers

consumers

consumers

ReplicationKafka local

DC 1

DC 2

consumers

consumers

Page 28: Multi-Datacenter Kafka - Strata San Jose 2017

Alternative: avoid aggregate clusters

Prefix topic names with DC tagConfigure replication to replicate remote topics onlyConsumers need to subscribe to topics with both DC tags

Kafka

producers

consumers

DC 1

Replication

DC 2

Kafka

producers

consumers

Page 29: Multi-Datacenter Kafka - Strata San Jose 2017
Page 30: Multi-Datacenter Kafka - Strata San Jose 2017

Beyond 2 DCs

More DCs better resource utilization• With 2 DCs, each DC needs to provision 100% traffic• With 3 DCs, each DC only needs to provision 50% traffic

Setting up replication with many DCs can be daunting• Only set up aggregate clusters in 2-3

Page 31: Multi-Datacenter Kafka - Strata San Jose 2017

Comparison

Pros ConsStretched • Better utilization of

resources• Easy failover for

consumers

• Still need cross region story

Active/passive

• Needed for global ordering • Harder failover for consumers• Reconfiguration during failover• Resource under-utilization

Active/active • Better utilization of resources

• Can be used to avoid consumer failover

• Can be challenging to manage• More replication bandwidth

Page 32: Multi-Datacenter Kafka - Strata San Jose 2017

Multi-DC beyond Kafka

Kafka often used together with other data storesNeed to make sure multi-DC strategy is consistent

Page 33: Multi-Datacenter Kafka - Strata San Jose 2017

Example application

Consumer reads from Kafka and computes 1-min countCounts need to be stored in DB and available in every DC

Page 34: Multi-Datacenter Kafka - Strata San Jose 2017

Independent database per DC

Run same consumer concurrently in both DCs• No consumer failover needed

Kafka local

Kafka aggrega

te

Kafka aggrega

te

producers

producers

consumer

consumer

ReplicationKafka local

DC 1

DC 2

DB DB

Page 35: Multi-Datacenter Kafka - Strata San Jose 2017

Stretched database across DCs

Only run one consumer per DC at any given point of time

Kafka local

Kafka aggrega

te

Kafka aggrega

te

producers

producers

consumer

consumer

ReplicationKafka local

DC 1

DC 2

DB DB

on failover

Page 36: Multi-Datacenter Kafka - Strata San Jose 2017

Practical tips

• Consume remote, produce local• Unless you need encrypted data on the wire• Monitor!

• Burrow for replication lag• Confluent Control Center for end-to-end• JMX metrics for rates and “busy-ness”

• Tune!• Producer / Consumer tuning• Number of consumers, producers• TCP tuning for WAN

• Don’t forget to replicate configuration• Separate critical topics from nice-to-have topics

Page 37: Multi-Datacenter Kafka - Strata San Jose 2017

Future work

Offset reset toolOffset preservation“Remote Replicas”2-DC stretch cluster

Other cool Kafka future:• Exactly Once• Transactions• Headers

Page 38: Multi-Datacenter Kafka - Strata San Jose 2017

THANK YOU!Gwen Shapira| [email protected] | @gwenshap

Kafka Training with Confluent University• Kafka Developer and Operations Courses• Visit www.confluent.io/training

Want more Kafka?• Download Confluent Platform Enterprise at http://www.confluent.io/product• Apache Kafka 0.10.2 upgrade documentation at http://docs.confluent.io/3.2.0/upgrade.html • Kafka Summit recordings now available at http://kafka-summit.org/schedule/

Page 39: Multi-Datacenter Kafka - Strata San Jose 2017

Discount code: kafstrataSpecial Strata Attendee discount code = 25% off www.kafka-summit.orgKafka Summit New York: May 8Kafka Summit San Francisco: August 28

Presented by