apache kafka reliability guarantees stratahadoop nyc 2015

42
When it absolutely, positively, has to be there Reliability Guarantees in Apache Kafka @jeffholoman @gwenshap Gwen Shapira Confluent Jeff Holoman Cloudera

Upload: jeff-holoman

Post on 21-Apr-2017

2.026 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

When it absolutely, positively,

has to be thereReliability Guarantees

in Apache Kafka

@jeffholoman @gwenshap

Gwen ShapiraConfluent

Jeff HolomanCloudera

Page 2: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Kafka High Throughput Low Latency Scalable Centralized Real-time

Page 3: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

“If data is the lifeblood of high technology, Apache Kafka is the

circulatory system”

--Todd PalinoKafka SRE @ LinkedIn

Page 4: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

If Kafka is a critical piece of our pipeline Can we be 100% sure that our data will get there? Can we lose messages? How do we verify? Who’s fault is it?

Page 5: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Distributed Systems Things Fail Systems are designed to

tolerate failure

We must expect failures and design our code and configure our systems to handle them

Page 6: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Network

Broker MachineClient Machine

Data Flow

Kafka ClientBroker

O/S Socket Buffer

NIC

NIC

Page Cache

Disk

Application Thread

O/S Socket Buffer

async

callback

✗✗

✗✗✗ data

ack / exception

Page 7: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Client Machine

Kafka Client

O/S Socket Buffer

NIC

Application Thread

✗✗Broker Machine

Broker

NIC

Page Cache

Disk

O/S Socket Buffer

miss

✗Network

Data Flow

data

offsets

ZKKafka✗

Page 8: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Replication is your friend Kafka protects against failures by replicating data The unit of replication is the partition One replica is designated as the Leader Follower replicas fetch data from the leader The leader holds the list of “in-sync” replicas

Page 9: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Replication and ISRs

0

1

2

0

1

2

0

1

2

Producer

Broker 100

Broker 101

Broker 102

Topic:Partitions

:Replicas:

my_topic33

Partition:

Leader:ISR:

1101

100,102

Partition:

Leader:ISR:

2102

101,100

Partition:

Leader:ISR:

0100

101,102

Page 10: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

ISR 2 things make a replica in-sync

- Lag behind leader- replica.lag.time.max.ms – replica that didn’t fetch or is behind - replica.lag.max.messages – will go away in 0.9

- Connection to Zookeeper

Page 11: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Terminology Acked

- Producers will not retry sending. - Depends on producer setting

Committed- Consumers can read. - Only when message got to all ISR.

replica.lag.time.max.ms - how long can a dead replica prevent

consumers from reading?

Page 12: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Replication Acks = all

- only waits for in-sync replicas to reply.

Replica 3100

Replica 2100

Replica 1100

Time

Page 13: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Replication

Replica 2100

101

Replica 1100

101

Time

Replica 3 stopped replicating for some reason

Acked in acks = all

“committed”

Acked in acks = 1

but not “committed”

Page 14: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Replication

Replica 2100

101

Replica 1100

101

Time

One replica drops out of ISR, or goes offline All messages are now acked and committed

Page 15: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Replication

Replica 1100

101102103104Time

2nd Replica drops out, or is offline

Page 16: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Replication

Time

Now we’re in trouble

Page 17: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Replication

Replica 3100

Replica 2100

101

Time

All those are “acked” and “committed”

Page 18: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

So what to do Disable Unclean Leader Election

- unclean.leader.election.enable = false Set replication factor

- default.replication.factor = 3 Set minimum ISRs

- min.insync.replicas = 2

Page 19: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Warning min.insync.replicas is applied at the topic-level. Must alter the topic configuration manually if created before the server level change Must manually alter the topic < 0.9.0 (KAFKA-2114)

Page 20: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Replication Replication = 3 Min ISR = 2

Replica 3100

Replica 2100

Replica 1100

Time

Page 21: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Replication

Replica 2100

101

Replica 1100

101

Time

One replica drops out of ISR, or goes offline

Page 22: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Replication

Replica 1100

101102103104

Time

2nd Replica fails out, or is out of sync

Buffers in

Producer

Page 23: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Page 24: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Producer Internals Producer sends batches of messages to a buffer

M3Application

ThreadApplication

ThreadApplication

Threadsend()

M2 M1 M0Batch 3Batch 2Batch 1

Fail? response

retry

Update Future

callback

drain

Metadata orException

Page 25: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Basics Durability can be configured with the producer configuration request.required.acks- 0 The message is written to the network (buffer)- 1 The message is written to the leader- all The producer gets an ack after all ISRs receive the data; the message is

committed

Make sure producer doesn’t just throws messages away!- block.on.buffer.full = true

Page 26: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

All calls are non-blocking async 2 Options for checking for failures:

- Immediately block for response: send().get()- Do followup work in Callback, close producer after error threshold

- Be careful about buffering these failures. Future work? KAFKA-1955- Don’t forget to close the producer! producer.close() will block until in-flight txns

complete retries (producer config) defaults to 0 message.send.max.retries (server config) defaults to 3 In flight requests could lead to message re-ordering

Page 27: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015
Page 28: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Consumer Two choices for Consumer API

- Simple Consumer- High Level Consumer

Page 29: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Consumer OffsetsP0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Page 30: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Consumer OffsetsP0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Commit?

Page 31: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Consumer OffsetsP0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Commit?

Page 32: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Consumer OffsetsP0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Auto-commit

enabled

✗Commit

Page 33: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Consumer OffsetsP0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Auto-commit

enabled

Page 34: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Consumer OffsetsP0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Auto-commit

enabled Consumer

Picks up here

Page 35: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Consumer OffsetsP0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Commit

Page 36: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Consumer OffsetsP0 P2 P3 P4 P5 P6

Consumer

Thread 1 Thread 2 Thread 3 Thread 4

Commit

Offset commits

for all threads

Page 37: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

P0 P2 P3 P4 P5 P6

Consumer 1

Consumer 2

Consumer 3

Consumer 4

Consumer Offsets

Auto-commit

DISABLED

Commit

Page 38: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Consumer Recommendations Set autocommit.enable = false Manually commit offsets after the message data is processed / persisted

consumer.commitOffsets(); Run each consumer in it’s own thread

Page 39: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

New Consumer! No Zookeeper! At all! Rebalance listener Commit:

- Commit- Commit async- Commit( offset)

Seek(offset)

Page 40: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Exactly Once Semantics At most once is easy At least once is not bad either – commit after 100% sure data is safe Exactly once is tricky

- Commit data and offsets in one transaction- Idempotent producer

Page 41: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Monitoring for Data Loss Monitor for producer errors – watch the retry numbers Monitor consumer lag – MaxLag or via offsets Standard schema:

- Each message should contain timestamp and originating service and host Each producer can report message counts and offsets to a special topic “Monitoring consumer” reports message counts to another special topic “Important consumers” also report message counts Reconcile the results

Page 42: Apache Kafka Reliability Guarantees StrataHadoop NYC 2015

Be Safe, Not Sorry Acks = all Block.on.buffer.full = true Retries = MAX_INT ( Max.inflight.requests.per.connect = 1 ) Producer.close() Replication-factor >= 3 Min.insync.replicas = 2 Unclean.leader.election = false Auto.offset.commit = false Commit after processing Monitor!