current and future of apache kafka

Apache Kafka

The current and future

http://kafka.apache.org/

http://kafka.apache.org/

Joe Stein

• Developer, Architect & Technologist

• Founder & Principal Consultant => Big Data Open Source Security LLC - http://stealth.ly

Big Data Open Source Security LLC provides professional services and product solutions for the collection,

storage, transfer, real-time analytics, batch processing and reporting for complex data streams, data sets and

distributed systems. BDOSS is all about the "glue" and helping companies to not only figure out what Big Data

Infrastructure Components to use but also how to change their existing (or build new) systems to work with

them.

• Apache Kafka Committer & PMC member

• Blog & Podcast - http://allthingshadoop.com

• Twitter @allthingshadoop

http://stealth.ly

http://allthingshadoop.com

https://twitter.com/allthingshadoop

Apache Kafka

• Apache Kafka

o http://kafka.apache.org

• Apache Kafka Source Code

o https://github.com/apache/kafka

• Documentation

o http://kafka.apache.org/documentation.html

• FAQ

o https://cwiki.apache.org/confluence/display/KAFKA/FAQ

• Wiki

o https://cwiki.apache.org/confluence/display/KAFKA/Index

http://kafka.apache.org

https://github.com/apache/kafka

http://kafka.apache.org/documentation.html

https://cwiki.apache.org/confluence/display/KAFKA/FAQ

https://cwiki.apache.org/confluence/display/KAFKA/Index

It often starts with just one data pipeline

Reuse of data pipelines for new providers

Reuse of existing providers for new consumers

Eventually the solution becomes the problem

And then it gets worse!

Kafka decouples data-pipelines

Topics & Partitions

A high-throughput distributed messaging system

rethought as a distributed commit log.

LinkedIn’s Kafka Clusters

Where we were when things started

Replication response times

Consumer Throughput

New (0.8.2-beta) JVM Producer

1 producer, replication x 3 async 786,980 records/sec

(75.1 MB/sec)

1 producer, replication x 3 sync 421,823 records/sec

(40.2 MB/sec)

3 producer, replication x 3 async 2,024,032 records/sec

(193.0 MB/sec)

End-to-end latency

2 ms (median)

3 ms (99th percentile)

14 ms (99.9th percentile)

Message Size vs Throughput (count)

Message Size vs Throughput (MB)

Recap

• Producers - ** push **

o Batching

o Compression

o Sync (Ack), Async (auto batch)

o Replication for durability and fault tolerance

o Sequential writes, guaranteed ordering within each partition

• Consumers - ** pull **

o No state held by broker

o Consumers control reading from the stream

• Zero Copy for producers and consumers to and from the broker

http://kafka.apache.org/documentation.html#maximizingefficiency

• Message stay on disk when consumed, deletes on TTL or compaction

https://kafka.apache.org/documentation.html#compaction

http://kafka.apache.org/documentation.html#maximizingefficiency

https://kafka.apache.org/documentation.html#compaction

Client Libraries

● JVM Client supported by the Apache Project https://kafka.apache.org/documentation.html#api

● Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients

• Python - Pure Python implementation with full protocol support. Consumer and Producer

implementations included, GZIP and Snappy compression supported.

• C - High performance C library with full protocol support

• C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset.

• Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer

implementations included, GZIP and Snappy compression supported.

• Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy

compression supported. Ruby 1.9.3 and up (CI runs MRI 2.

• Clojure - Clojure DSL for the Kafka API

• JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation

• stdin & stdout

Wire Protocol Developers Guide

https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol

https://kafka.apache.org/documentation.html#api

https://cwiki.apache.org/confluence/display/KAFKA/Clients

https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol

● LinkedIn - Apache Kafka is used at LinkedIn for activity

stream data and operational metrics. This powers various

products like LinkedIn Newsfeed, LinkedIn Today in addition

to our offline analytics systems like Hadoop.

● Twitter - As part of their Storm stream processing

infrastructure, e.g. this.

● Netflix - Real-time monitoring and event-processing pipeline.

● Square - We use Kafka as a bus to move all systems events

through our various datacenters. This includes metrics, logs,

custom events etc. On the consumer side, we output into

Splunk, Graphite, Esper-like real-time alerting.

● Spotify - Kafka is used at Spotify as part of their log delivery

system.

● Pinterest - Kafka is used with Secor as part of their log

collection pipeline.

● Uber

● Tumblr - See this

● Box - At Box, Kafka is used for the production analytics

pipeline & real time monitoring infrastructure. We are

planning to use Kafka for some of the new products &

● Mozilla - Kafka will soon be replacing part of our current

production system to collect performance and usage data

from the end-users browser for projects like Telemetry, Test

Pilot, etc. Downstream consumers usually persist to either

HDFS or HBase.

● Tagged - Apache Kafka drives our new pub sub system

which delivers real-time events for users in our latest game -

Deckadence. It will soon be used in a host of new use cases

including group chat and back end stats and log collection.

● Foursquare - Kafka powers online to online messaging, and

online to offline messaging at Foursquare. We integrate with

monitoring, production systems, and our offline infrastructure,

including hadoop.

● StumbleUpon - Data collection platform for analytics.

● Coursera - At Coursera, Kafka powers education at scale,

serving as the data pipeline for realtime learning

analytics/dashboards.

● Shopify - Access logs, A/B testing events, domain events ("a

checkout happened", etc.), metrics, delivery to HDFS, and

customer reporting. We are now focusing on consumers:

analytics, support tools, and fraud analysis.

http://www.linkedin.com/

http://www.twitter.com/

http://engineering.twitter.com/2013/01/improving-twitter-search-with-real-time.html

http://www.netflix.com/

http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html

https://squareup.com/

http://www.spotify.com/

http://www.meetup.com/stockholm-hug/events/121628932

http://www.pinterest.com/

http://engineering.pinterest.com/post/84276775924/introducing-pinterest-secor

http://www.pinterest.com/

https://www.uber.com/

https://www.tumblr.com/

http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html

https://www.box.com/

http://mozilla.org/

http://www.tagged.com/

http://foursquare.com/

http://www.stumbleupon.com/

https://www.coursera.org/

http://www.shopify.com/

● Mate1.com Inc. - Apache kafka is used at Mate1 as our main

event bus that powers our news and activity feeds,

automated review systems, and will soon power real time

notifications and log distribution.

● Boundary - Apache Kafka aggregates high-flow message

streams into a unified distributed pubsub service, brokering

the data for other internal systems as part of Boundary's real-

time network analytics infrastructure.

● Ancestry.com - Kafka is used as the event log processing

pipeline for delivering better personalized product and

service to our customers.

● DataSift - Apache Kafka is used at DataSift as a collector of

monitoring events and to track user's consumption of data

streams in real

time.http://highscalability.com/blog/2011/11/29/datasift-

architecture-realtime-datamining-at-120000-tweets-p.html

● Spongecell - We use Kafka to run our entire analytics and

monitoring pipeline driving both real-time and ETL

applications for our customers.

● Wooga - We use Kafka to aggregate and process tracking

data from all our facebook games (which are hosted at

● AddThis - Apache Kafka is used at AddThis to collect events

generated by our data network and broker that data to our

analytics clusters and real-time web analytics platform.

● Urban Airship - At Urban Airship we use Kafka to buffer

incoming data points from mobile devices for processing by

our analytics infrastructure.

● Metamarkets - We use Kafka to ingest real-time event data,

stream it to Storm and Hadoop, and then serve it from our

Druid cluster to feed our interactive analytics dashboards.

We've also built connectors for directly ingesting events from

Kafka into Druid.

● Simple - We use Kafka at Simple for log aggregation and to

power our analytics infrastructure.

● Gnip - Kafka is used in their twitter ingestion and processing

pipeline.

● Loggly - Loggly is the world's most popular cloud-based log

management. Our cloud-based log management service

helps DevOps and technical teams make sense of the the

massive quantity of logs. Kafka is used as part of our log

collection and processing infrastructure.

http://www.mate1.com/about

https://boundary.com/

http://www.ancestry.com/

http://blogs.ancestry.com/techroots/on-track-to-data-driven

http://datasift.com/

http://highscalability.com/blog/2011/11/29/datasift-architecture-realtime-datamining-at-120000-tweets-p.html

http://www.spongecell.com/

http://wooga.com/

http://addthis.com/

http://www.urbanairship.com/

http://metamarkets.com/

http://druid.io/docs/latest/Kafka-Eight.html

https://www.simple.com/

http://gnip.com/

http://loggly.com/

http://www.loggly.com/behind-the-screens

● RichRelevance - Real-time tracking event pipeline.

● SocialTwist - We use Kafka internally as part of our reliable

email queueing system.

● Countandra - We use a hierarchical distributed counting

engine, uses Kafka as a primary speedy interface as well as

routing events for cascading counting

● FlyHajj.com - We use Kafka to collect all metrics and events

generated by the users of the website.

● uSwitch - See this blog.

● InfoChimps - Kafka is part of the InfoChimps real-time data

platform.

● Visual Revenue - We use Kafka as a distributed queue in

front of our web traffic stream processing infrastructure

(Storm).

● Oolya - Kafka is used as the primary high speed message

queue to power Storm and our real-time analytics/event

ingestion pipelines.

● Datadog - Kafka brokers data to most systems in our metrics

and events ingestion pipeline. Different modules contribute

and consume data from it, for streaming CEP (homegrown),

● VisualDNA We use Kafka 1. as an infrastructure that helps us

bring continuously the tracking events from various

datacenters into our central hadoop cluster for offline

processing, 2. as a propagation path for data integration, 3.

as a real-time platform for future inference and

recommendation engines

● Sematext - in SPM (performance monitoring + alerting),

Kafka is used for metrics collection and feeds SPM's in-

memory data aggregation (OLAP cube creation) as well as

our CEP/Alerts servers (see also: SPM for Kafka

performance monitoring). In SA (search analytics) Kafka is

used in search and click stream collection before being

aggregated and persisted. In Logsene (log analytics) Kafka is

used to pass logs and other events from front-end receivers

to the persistent backend.

● Wize Commerce - At Wize Commerce (previously, NexTag),

Kafka is used as a distributed queue in front of Storm based

processing for search index generation. We plan to also use

it for collecting user generated data on our web tier, landing

the data into various data sinks like Hadoop, HBase, etc.

● Quixey - At Quixey, The Search Engine for Apps, Kafka is an

integral part of our eventing, logging and messaging

http://www.richrelevance.com/

http://www.socialtwist.com/

http://www.countandra.org/

http://www.flyhajj.com/

http://www.uswitch.com/

http://oobaloo.co.uk/kafka-for-uswitchs-event-pipeline

http://www.infochimps.com/

http://blog.infochimps.com/2012/10/30/next-gen-real-time-streaming-storm-kafka-integration

http://www.visualrevenue.com/

http://www.ooyala.com/

http://datadog.com/

http://www.visualdna.com/

http://sematext.com/

http://sematext.com/spm

http://blog.sematext.com/2013/10/16/announcement-spm-performance-monitoring-for-kafka/

http://sematext.com/search-analytics

http://sematext.com/logsene

http://www.wizecommerce.com/

http://quixey.com/

● LinkSmart - Kafka is used at LinkSmart as an event stream

feeding Hadoop and custom real time systems.

● LucidWorks Big Data - We use Kafka for syncing LucidWorks

Search (Solr) with incoming data from Hadoop and also for

sending LucidWorks Search logs back to Hadoop for analysis.

● Cloud Physics - Kafka is powering our high-flow event

pipeline that aggregates over 1.2 billion metric series from

1000+ data centers for near-to-real time data center

operational analytics and modeling

● Graylog2 - Graylog2 is a free and open source log

management and data analysis system. It's using Kafka as

default transport for Graylog2 Radio. The use case is

described here.

● Yieldbot - Yieldbot uses kafka for real-time events, camus for

batch loading, and mirrormakers for x-region replication.

● LivePerson - Using Kafka as the main data bus for all real

time events.

● Retention Science - Click stream ingestion and processing.

● Strava - Powers our analytics pipeline, activity feeds denorm

and several other production services.

● Outbrain - We use Kafka in production for real time log collection and

processing, and for cross-DC cache propagation.

● SwiftKey - We use Apache Kafka for analytics event processing.

● Yeller - Yeller uses Kafka to process large streams of incoming exception

data for it's customers. Rate limiting, throttling and batching are all built on

top of Kafka.

● Emerging Threats - Emerging threats uses Kafka in our event pipeline to

process billions of malware events for search indices, alerting systems, etc.

● Hotels.com - Hotels.com uses Kafka as pipeline to collect real time events

from multiple sources and for sending data to HDFS.

● Helprace - Kafka is used as a distributed high speed message queue in our

help desk software as well as our real-time event data aggregation and

analytics.

● Exponential is using Kafka in production to power the events ingestion

pipeline for real time analytics and log feed consumption.

● Livefyre - uses Kafka for the real time notifications, analytics pipeline and as

the primary mechanism for general pub/sub.

● Exoscale - uses Kafka in production.

● Cityzen Data - uses Kafka as well, we provide a platform for collecting,

storing and analyzing machine data.

● Criteo - use Kafka in production for over a year for stream processing and

log transfer (over 2M messages/s and growing)

● The Wikimedia Foundation - uses Kafka as a log transport for analytics data

http://www.linksmart.com/

http://www.lucidworks.com/products/lucidworks-big-data

http://www.cloudphysics.com/

http://graylog2.org/

http://support.torch.sh/help/kb/graylog2-server/using-graylog2-radio-v020x

http://www.yieldbot.com/

http://www.liveperson.com/

http://www.retentionscience.com/

http://www.strava.com/

http://www.outbrain.com/

http://www.swiftkey.net/

http://yellerapp.com/

http://emergingthreats.net/

http://www.hotels.com/

http://helprace.com/help-desk

http://www.exponential.com/

http://web.livefyre.com/

https://www.exoscale.ch/

http://www.cityzendata.com/

http://www.criteo.com/

http://wikimediafoundation.org/wiki/Our_projects

Where are we going? (0.8.2)

● 0.8.2-beta is released https://kafka.apache.org/downloads.html

○ Better consistency vs. availability (min.isr per topic)

○ LZ4 Compression

○ Offset Storage away from Zookeeper

○ Better Topic Management

○ New Java Producer

○ Over 100 bug fixes

○ More!

https://archive.apache.org/dist/kafka/0.8.2-beta/RELEASE_NOTES.html

https://kafka.apache.org/downloads.html

https://archive.apache.org/dist/kafka/0.8.2-beta/RELEASE_NOTES.html


● New Consumer

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design

● Continued operational improvements

o Command Line Interface / Admin API

o https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Command+Line+and+Relate

d+Improvements

● Security https://cwiki.apache.org/confluence/display/KAFKA/Security

o Authentication

TLS/SSL

Kerberos

o Authorization

Plugable

● Idempotence, Transactions

https://cwiki.apache.org/confluence/display/KAFKA/Transactional+Messaging+in+Kafka

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Command+Line+and+Related+Improvements

https://cwiki.apache.org/confluence/display/KAFKA/Security

https://cwiki.apache.org/confluence/display/KAFKA/Transactional+Messaging+in+Kafka

Really Quick Start (Scala)

1) Install Vagrant http://www.vagrantup.com/

2) Install Virtual Box https://www.virtualbox.org/

3) git clone https://github.com/stealthly/scala-kafka

4) cd scala-kafka

5) vagrant up

Zookeeper will be running on 192.168.86.5

BrokerOne will be running on 192.168.86.10

All the tests in ./src/test/scala/* should pass (which is also /vagrant/src/test/scala/* in the vm)

6) ./gradlew test

http://www.vagrantup.com/

https://www.virtualbox.org/

https://github.com/stealthly/scala-kafka

Really Quick Start (Go)

1) Install Vagrant http://www.vagrantup.com/

2) Install Virtual Box https://www.virtualbox.org/

3) git clone https://github.com/stealthly/go-kafka

4) cd go-kafka

5) vagrant up

6) vagrant ssh brokerOne

7) cd /vagrant

8) sudo ./test.sh

http://www.vagrantup.com/

https://www.virtualbox.org/

https://github.com/stealthly/go-kafka

Questions?

/*******************************************

Joe Stein

Founder, Principal Consultant

Big Data Open Source Security LLC

http://www.stealth.ly

Twitter: @allthingshadoop

********************************************/

http://www.stealth.ly/

http://www.twitter.com/allthingshadoop

current and future of apache kafka

Technology