kafka blr-meetup-presentation - kafka internals

Kafka InternalsAyyappadas Ravindran Linkedin Bangalore SRE Team

Introduction

• Who am I ?– Ayyappadas Ravindran– Staff SRE in Linkedin– Responsible for Data Infra Streaming team

• What is this talk about ?– Kafka building blocks in details– Operating Kafka– Data assurance with Kafka–Kafka 0.9

Agenda

• Kafka – Reminder !• Zookeeper• Kafka Cluster – Brokers • Kafka – Message• Producers• Schema Registry• Consumers• Data Assurance• What is new in Kafka (Kafka 0.9)• Q & A

Kafka Pub/Sub Basics – Reminder !

Broker AP0

AP1

AP1

AP0 AP0

Consumer

Producer

Zookeeper

Zookeeper

• Distributed coordination service• Also used for maintaining configuration• Guarantees

– Order– Atomicity – Reliability

• Simple API• Hierarchical Namespace• Ephemeral Nodes• Watches

Zookeeper in Kafka ecosystem

• Used to store metadata information– About brokers– About topics & partitions – Consumers / Consumer groups

• Service coordination– Controller election– For administrative tasks

Ref : https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper

Zookeeper at Linkedin

• We are running Zookeeper 3.4

• Cluster of 5 (participants) + 1 (observer)

• Network and power redundancy

• Transaction logs on SSD.

• Lesson Learned : Do not over build your cluster

Kafka Cluster - Brokers

• Brokers– Runs Kafka– Stores commit logs

• Why cluster ?– Redundancy and fault tolerance– Horizontal scalability – Improves reads and writes. Better network usage & disk IO

• Controller – special broker

Kafka Message

• Distributed partition replicated commit log.• Messages

– Fixed size Header– Variable length Payload (byte array)– Payload can have any serialized data. – Linkedin uses Avro

• Commit Logs– Stored in sequence file under folders named with topic name– contains sequence of log entries

Kafka Message - continued

• Logs

– Log entry (message) have 4 byte header and followed N byte messages

– offset is a 64 byte integer

– offset give the position of message from the start of the stream

– on disk log files are saved as segment files

– segment files are named with the first offset message in that file. E.g. 00000000000.kafka

Kafka Message - continued

• Write to logs– Appends to the latest segment file– OS flushes the messages to disk either based on number of messages or time

• Reads from logs– Consumer provides offset & a chunk size– Returns an iterator to iterate over the message set– On failure, consumers can start consuming from either the start of the stream or from latest offset

Message Retention

• Kafka retains and expires messages via three options– Time-based (the default, which keeps messages for at least 168 hours)– Size-based (configurable amount of messages per-partition)– Key-based (one message is retained for each discrete key)

• Time and size retention can work together, but not with key-based– With time and size configured, messages are retained either until the size limit is reached OR the time limit is reached, whichever comes first

• Retention can be overridden per-topic– Use the kafka-topics.sh CLI to set these configs

Kafka Producer• Producer publishes message to topic

– metadata.broker.list– serializer.class– partitioner.class – request.required.acks (0,1,-1)– topics

• Partition strategy– DefaultPartitioner – Round Robin– DefaultPartitioner with Keyed messages – Hashing

Ref : https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example

Kafka Producer - Continued

• Message Batching • Compression (gzip, snappy & lz4)• Sticky partition• CLI

– Create a topic • bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic newtopic --replication-factor 1 --partitions 1

– Produce messages • bin/kafka-console-producer.sh –broker-list localhost:9092 -–topic newtopic

Ref : https://cwiki.apache.org/confluence/display/KAFKA/Clients

Schema Registry

Kafka consumer• Consumer are the processes subscribed to a topic and that processes the feeds•High level consumer

– multi threaded– manages offset for you

• Simple consumer– Greater control over consumption – Need to manage offset– Need to find broker for leader partition

Kafka Consumer -- continued

• Important options to provide while consuming – Zookeeper details – Topic name– Where to start consuming (from beginning or from the tail)– auto.offset.reset – group.id– auto.commit.enable (true)

• console consumer – Helps in debugging issues & can be used inside application – bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic mytopic --from-beginning

Basic Kafka operations

• Add a topic– bin/kafka-topics.sh --zookeeper zk_host:port/chroot --create --topic newtopic --partitions 10 --replication-factor 3 --config x=y

• Modify topic – bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic newtopic –partitions 20– beware this may impact semantically partitioned topic

• Modify configuration– bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic newtopic --config x=y

• Delete configurations – bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic newtopic --deleteConfig x

Basic Kafka operations -- continued

• DO NOT DELETE TOPICS ! Though you have an option to do that• What happens when a broker dies ?

– Leader fail over– corrupted index / log files– URP– Uneven leader distribution

•Preferred replica election– bin/kafka-preferred-replica-election.sh --zookeeper zk_host:port/chroot – or auto.leader.rebalance.enable=true

20

Adding a broker

Brokers

Consumers

Producers

AP1

AP0

BP1

BP0

AP5

AP4

BP5

BP4

AP3

AP2

BP3

BP2

AP7

AP6

BP7

BP6

AP5

AP4

BP5

BP4

AP1

AP0

BP1

BP0

AP7

AP6

BP7

BP6

AP3

AP2

BP3

BP2

CP1

CP0

CP3

CP2

CP1

CP0

CP3

CP2

Kafka operations – continued

• Expanding Kafka cluster– Create a brokers with new broker ID– Will not automatically move topics to new brokers– Admin need to initiate the move

• Generate the plan : bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --topics-to-move-json-file topics-to-move.json --broker-list "5,6" –generate• Execute the plan : bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file expand-cluster-reassignment.json –execute• Verify the execution : bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 --reassignment-json-file expand-cluster-reassignment.json --verify

Data Assurance

• No data loss or no reordering – Critical for applications like DB replication – Can Kafka do this ? Yes !

• Cause of data loss on producer side– setting block.on.buffer.full=false– retires exhausting– sending messages with out ack=all

• How can you fix ?– set block.on.buffer.full=true– set retired to Long.MAX_VALUE– set acks to all– have resend in your call back function (producer.send(record, callback))

Data Assurance - Continued

• Cause of data loss on consumer side– offsets are carelessly committed– data loss can happen if consumer committed the offset, but died while processing the message

• Fixing data loss on consumer side– commit offset only after processing of the message is completed – disable auto.offset.commit

• Fixing on Broker Side– have replication factor >= 3– have min.isr 2– disable unclean leader election

Data Assurance - Continued

• Message reordering – If more than one message is in transit– and also retry is enabled

• Fixing message reordering– set max.in.flight.requests.per.connection=1

Kafka 0.9 (Beta release)

• Security – Kerberos or TLS based authentication– Unix like permission to restrict who can access data – Encryption on the wire Via SSL

• Kafka Connect – support large-scale real-time import and export for Kafka– takes care of fault tolerance, offset management and delivery management– will be supporting connectors for Hadoop and database

• User defined quota– To manage abusive clients– rate limit traffic or producer side and consumer side

Kafka 0.9 (Beta release)

– Allows only 10MBps for read and 5MBps for write– If clients violate, slows down – Can be overridden

• New Consumer – Removes distinction between high level consumer and simple consumer– Unified consumer API– No longer zookeeper dependent – Offers pluggable offset management

27

How Can You Get Involved?

•http://kafka.apache.org

•Join the mailing lists–[email protected]

• irc.freenode.net - #apache-kafka

http://kafka.apache.org/

mailto:[email protected]


Q & AWant to contact us ?

Akash Vacher ([email protected])Ayyappadas Ravindran ([email protected])

Talent Partner : Syed Hussain ([email protected])Mob : +91 953 581 8876




kafka blr-meetup-presentation - kafka internals

Technology