introduction to kafka

Introduction to KafkaAkash Vacher

2015/12/5

▪Akash VacherSRE,Data Infrastructure Streaming (Bengaluru)Linkedin

SRE?

▪Site Reliability Engineers

–Administrators

–Architects

–Developers

▪Keep the site running, always

Agenda

▪ Kafka Overview

▪ Some facts and figures

▪ Basic Kafka concepts

▪ Some use cases

▪ Q and A

Kafka Overview

▪ High-throughput distributed messaging system

▪ Kafka guarantees:

– At least once delivery

– Strong ordering

▪ Developed at Linkedin and open sourced in early 2011

▪ Implemented in Scala and Java

Kafka users

Source: https://cwiki.apache.org/confluence/display/KAFKA/Powered+By

Attributes of a Kafka Cluster

• Disk Based

• Durable

• Scalable

• Low Latency

• Finite Retention

Motivation

▪ Unified platform to handle all real time data feeds

▪ High throughput

▪ Stream Processing

▪ Horizontally scalable

Before

How is Kafka used at Linkedin?

▪ Monitoring (inGraphs)

▪ User tracking

▪ Email and SMS notifications

▪ Stream processing (Samza)

▪ Database Replication

Facts and figures

▪ Over 1,300,000,000,000 messages are produced to Kafka everyday at LinkedIn

▪ 300 Terabytes of inbound and 900 Terabytes of outbound traffic

▪ 4.5 Million messages per second, on single cluster

▪ Kafka runs on ~1300 servers at LinkedIn

Building blocks

The humble log

Anatomy of a topic

Consumer groups

Bird’s eye view

Kafka in action

Broker AP0

AP1

AP1

AP0 AP0

Consumer

Producer

Zookeeper

Performance recipe

▪ OS page cache▪ Linear IO, never fear the file system!▪ sendfile(), system call▪ Message batching

Operating Kafka▪ Broker Hardware

– Cisco C240, Intel xeon quad core, 64GB RAM , 14 disk Raid-10

▪ Zookeeper Hardware– 5 + 1 ensemble, 64GB RAM, 500GB SSD

Operating Kafka▪ Monitoring

– Under Replicated Partitions– Unclean leader election– Lag monitoring– Burrow

▪ Cluster rebalance – Sizewise rebalance– Partitionwise rebalance

Kafka at Linkedin

▪ Multiple data centers

▪ Mirror data

▪ Cluster Types

– Tracking

– Metrics

– Queuing

▪ Data transport from applications to Hadoop, and back

Metrics collection▪ Building Blocks

– Sensors– RRD– Front end

▪ Facts & Figures

– 320,000,000 metrics collected per minute

– 530 TB of disk space

– Over 210,000 metricscollected per service

InGraphs

Kafka for database replication - Master slave

Kafka for database replication - Multi master

How Can You Get Involved?

▪ http://kafka.apache.org

▪ Join the mailing lists–[email protected]

▪ irc.freenode.net - #apache-kafka

▪ Contribute

http://kafka.apache.org/

mailto:[email protected]

mailto:[email protected]

Questions?

introduction to kafka

Data & Analytics