apache kafka best practices

35
Apache Kafka Best Practices Manikumar Reddy @omkreddy

Upload: dataworks-summithadoop-summit

Post on 21-Jan-2018

16.955 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Apache Kafka Best Practices

Apache Kafka Best PracticesManikumar Reddy

@omkreddy

Page 2: Apache Kafka Best Practices

2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Kafka

Core APIs– The Producer API

– The Consumer API

– The Connector API

– The Streams API

Broad classes of applications– Building real-time streaming data pipelines

– Building real-time streaming applications

– core building block in other data systems

Page 3: Apache Kafka Best Practices

3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Key Concepts and Terminology

Page 4: Apache Kafka Best Practices

4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Component Layout

Page 5: Apache Kafka Best Practices

5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Hardware Guidance

Cluster Size Memory CPU Storage

Kafka Brokers 3+24G+ (for small)64GB+ (for large)

Multi- core processors( 12 CPU+ core), Hyper threading enabled

6+ x 1TB dedicated disks( RAID or JBOD)

Zookeeper3 (for small)5 (for large)

8GB+ (for small)24GB+ (for large)

2 core +

SSD for Transaction logs

Page 6: Apache Kafka Best Practices

6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

OS Tuning

OS Page Cache– Ex: Allocate to hold all the active segments of the log.

File descriptor limits : >100k

less swapping

Tcp tuning

JVM Configs– Java 8 with G1 Collector

– 6-8 GB heap

Page 7: Apache Kafka Best Practices

7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Kafka Disk Storage

Use multiple disk spindles, dedicated to kafka

JBOD vs RAID10

JBOD– Gives all the disk I/O

JBOD Limitations– any disk failure causes an unclean shutdown and requires lengthy recovery

– data is not distributed consistently across disks

– Multiple directories

KIP-112/113– necessary tools for users to manage JBOD

– Intelligent partition assignment

– On disk failure, broker can serve replicas on the good disks

– re-assign replicas between disks of the same broker

Page 8: Apache Kafka Best Practices

8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

RAID

RAID10– Can survive single disk failure

– Performance and protection

– balance load across disks

– Single mount point

– Performance hit and reduces the space

File System– EXT or XFS

– SSD

– Issues on NFS.

– SAN, NAS

Page 9: Apache Kafka Best Practices

9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Basic Monitoring

CPU Load

Network Metrics

File Handle Usage

Disk Space

Disk I/O Performance

Garbage Collection

ZooKeeper Monitoring

Page 10: Apache Kafka Best Practices

10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Kafka Replication

Partition has replicas – Leader replica, Follower replicas

Leader maintains in-sync-replicas (ISR)– replica.lag.time.max.ms, num.replica.fetchers

– min.insync.replica – used by producer to ensure greater durability

https://www.slideshare.net/junrao/kafka-replication-apachecon2013

Page 11: Apache Kafka Best Practices

11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Under Replicated Partitions

Number of partitions which are not fully replicated within the cluster

Mbean - kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

ISR Shrink/Expand Rate

Under Replicated Partitions– Lost Broker?

– Controller Issues

– Zookeeper Issues

– Network Issues

Solutions– Tune the ISR settings

– Expand brokers

Page 12: Apache Kafka Best Practices

12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Controller

Manages Partitions Life cycle

Avoid controller's ZK session expires– Soft failures – ISR Churn/Under replicated partitions

– ZK Server performance

– Long GC pauses on Broker

– Bad network configuration

Monitoring– Mbean : kafka.controller:type=KafkaController,name=ActiveControllerCount

– only one broker in the cluster should have 1

– LeaderElectionRate

Page 13: Apache Kafka Best Practices

13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Unclean leader election

Enable replicas not in the ISR set to be elected as leader

Availability vs correctness – By-default kafka chooses availability

Monitoring– Mbean : kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec

Default will be changed in next release

Page 14: Apache Kafka Best Practices

14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Broker Configs

log.retention.{ms, minutes, hours} , log.retention.bytes

message.max.bytes, replica.fetch.max.bytes

delete.topic.enable

unclean.leader.election.enable = false

min.insync.replicas = 2

replica.lag.time.max.ms, num.replica.fetchers

replica.fetch.response.max.bytes

zookeeper.session.timeout.ms = 30s

num.io.threads

Page 15: Apache Kafka Best Practices

15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Cluster Sizing

Broker Sizing– Partition count on each broker (<2K)

– Keep partition size on disk manageable (under 25GB per partition )

Cluster Size (no. of brokers)– how much retention we need

– how much traffic cluster is getting

Cluster Expansion– Disk usage on the log segments partition should stay under 60%

– Network usage on each broker should stay under 75%

Cluster Monitoring– Keep cluster balanced

– Ensure that partitions of a topic are fairly distributed across brokers

– Ensure that nodes in a cluster are not running out of disk and network

Page 16: Apache Kafka Best Practices

16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Broker Monitoring

Partition Counts– Mbean: kafka.server:type=ReplicaManager,name=PartitionCount

Leader replica counts– Mbean: kafka.server:type=ReplicaManager,name=LeaderCount

ISR Shrink Rate/ISR expansion rate– kafka.server:type=ReplicaManager,name=IsrExpandsPerSec

Message in rate/Byte in rate/Byte out rate

NetworkProcessorAvgIdlePercent

RequestHandlerAvgIdlePercent

Page 17: Apache Kafka Best Practices

17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Topic Sizing

No. of partitions – Have at least as many partitions as there are consumers in the largest group

– topic is very busy – more partitions

– Keep partition size on disk manageable (under 25GB per partition )

– Take into account any other application requirements

– Special use cases – single partition

Keyed messages– enough partitions to deal with future growth

expanding partitions – whenever the size of the partition on disk is larger than threshold

Page 18: Apache Kafka Best Practices

18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Choosing Partitions

Based on throughput requirements one can pick a rough number of partitions.– Lets call the throughput from producer to a single partition is P

– Throughput from a single partition to a consumer is C

– Target throughput is T

– At least max (T/P, T/C)

More Partitions– More open file handles

– May increase unavailability

– May increase end-to-end latency

– More memory for clients

Page 19: Apache Kafka Best Practices

19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Quotas

Protect from bad clients and maintain SLAs

byte-rate thresholds on produce and fetch requests

can be applied to (user, client-id), user or client-id groups.

Server delays the responses

Broker Metrics for monitoring – throttle-rate, byte-rate

replica.fetch.response.max.bytes– Limit memory usage of replica fetch response

Limiting bandwidth usage during data migration– kafka-reassign-partitions.sh -- -throttle option

Page 20: Apache Kafka Best Practices

20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Kafka Producer

User new java based clients

Test in your Environment– kafka-producer-perf-test.sh

Memory

CPU

Batch Compression

Avoid large messages– creates more memory pressure

– slows down the brokers

Page 21: Apache Kafka Best Practices

21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Critical Configs

batch.size– size based batching

– larger size -> high throughput, higher latency

linger.ms– time based batching

– larger size -> high throughput, higher latency

max.in.flight.requests.per.connection– Better throughput, affects ordering

compression.type– adding more user threads can help throughput

acks– Affects message durability

Page 22: Apache Kafka Best Practices

22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Performance tuning

If throughput < network capacity– Add more user threads

– Increase batch size

– Add more producers instances

– Add more partitions

Latency when acks = -1– Increase num.replica.fetchers

Cross datacenter data transfer – Tune socket buffer settings, OS tcp buffer settings

Page 23: Apache Kafka Best Practices

23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Producer Monitoring

batch-size-avg

compression-rate-avg

waiting-threads

buffer-available-bytes

record-queue-time-max

record-send-rate

records-per-request-avg

Page 24: Apache Kafka Best Practices

24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Kafka Consumer

Test in your Environment– kafka-consumer-perf-test.sh

Throughput Issues– not enough partitions

– OS Page Cache - allocate enough to hold all the messages for your consumers for say, 30s

– Application/Processing logic

Offsets topic– __consumer_offsets

– offsets.topic.replication.factor

– offsets.retention.minutes

– Monitor ISR, topic size

Slow offset commits– commit async, manual commits

Page 25: Apache Kafka Best Practices

25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Consumer Configs

fetch.min.bytes and fetch.max.wait.ms

max.poll.interval.ms

max.poll.records

session.timeout.ms

Consumer Rebalance– check timeouts

– check processing times/logic

– GC Issues

Tune network settings

Page 26: Apache Kafka Best Practices

26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Consumer Monitoring

Whether or not the consumer is keeping up with the messages that are being produced

Consumer Lag: Difference between the end of the log and the consumer offset

Monitoring– Metrics Monitoring - records-lag-max

– bin/kafka-consumer-groups.sh

– LinkedIn’s Burrow for consumer monitoring

Decreasing Lag– Analyze consumer - GC Issues, hung instance

– Add more consumer Instances

– increase the number of partitions and consumers

Page 27: Apache Kafka Best Practices

27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

No data loss settings

Producer– block.on.buffer.full=true

– retries=Long.MAX_VALUE

– acks=all

– max.in.flight.requests.per.connection=1

– close producer

Broker– replication factor >= 3

– min.insync.replicas=2

– disable unclean leader election

Consumer– disable auto.offset.commit

– Commit offsets only after the messages are processed

Page 28: Apache Kafka Best Practices

28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Authorizer - Ranger Auditing

Page 29: Apache Kafka Best Practices

29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Kafka Mirror Maker

Tool to mirror a source Kafka cluster into a target (mirror) Kafka cluster

Page 30: Apache Kafka Best Practices

30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Kafka Mirror Maker Run multiple mirroring processes

– high fault-tolerance

– high throughput

--num.streams option to specify the number of consumer threads

– no.of threads in num.streams

Page 31: Apache Kafka Best Practices

31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Kafka Mirror Maker Consumer and source cluster socket buffer sizes

– high value for the socket buffer size

– consumer's fetch size

– OS networking Tuning

Source and Target Clusters are independent entities– Can be different numbers of partitions

– offsets will not be the same.

– partitioning order is preserved on a per-key basis.

Create topics in target cluster

Monitor whether a mirror is keeping up– Consumer Lag

Running In Secure Clusters– We recommend to use SSL

– We can run MM on source cluster

Page 32: Apache Kafka Best Practices

32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Open source Operational Tools

Ambari Metrics– https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-user-

guide/content/grafana_kafka_dashboards.html

Removing brokers and rebalancing partitions in a cluster– https://github.com/linkedin/kafka-tools

Consumer Lag Monitoring– Burrow (https://github.com/linkedin/Burrow)

Kafka Manager - https://github.com/yahoo/kafka-manager

Page 33: Apache Kafka Best Practices

33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Apache Kafka 0.10.2 release

Includes 15 KIPs, over 200 bug fixes and improvements

The newest Java Clients now support older brokers (0.10.0 and higher)

Separation of Internal and External traffic

Create Topic Policy

Security Improvements– Support for SASL/SCRAM mechanisms

– Dynamic JAAS configuration for Kafka clients

– Support for authentication of multiple Kafka clients in single JVM

Producer and Consumer Improvements

Connect API & Streams API improvements

Page 34: Apache Kafka Best Practices

34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

Thank You

Page 35: Apache Kafka Best Practices

35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved

References

http://kafka.apache.org/documentation.html

https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html

https://www.slideshare.net/JiangjieQin/producer-performance-tuning-for-apache-kafka-63147600

https://www.slideshare.net/ToddPalino/tuning-kafka-for-fun-and-profit

https://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844

https://www.slideshare.net/ToddPalino/putting-kafka-into-overdrive

https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/