introduction to apache kafka

INTRO TO KAFKAJim Plush, Director of Cloud Engineering, CrowdStrike.comTwitter: @jimplush

http://CrowdStrike.com

ABOUT ME

Jim Plush, Director of Cloud Engineering @ CrowdStrike.com

Architect of distributed cloud services for catching bad guys

Previously Director of Engineering at gravity.com

personalization service, ingesting clickstream from Yahoo!, New York Times, WSJ, etc…

wrote most of the ETL workflow

http://CrowdStrike.com

http://gravity.com

ABOUT CROWDSTRIKE

“Big Data” Security Company

Near term focus on targeted, state sponsored attacks and attribution

Single customer can generate 2.2TB of machine data per day we process in our cloud

Horizontally scalable, distributed infrastructure

Uses goodies like Kafka, Cassandra, Elastic Search, Hadoop, Scala, Go

–Said everyone, always

“Some people, when confronted with a problem, think “I know, I'll use a message queue.” Now they have two problems.”

APACHE KAFKA

It’s not a so much a queue, but an activity stream system

Trades stability and speed for consumer complexity

It’s scalable by nature

Supports data replication

You can rewind time

It’s fast!

Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.

APACHE KAFKA - CONS

Consumer Complexity

Not “Rack Aware” replication

Lack of tooling/monitoring

Still pre 1.0 release

Operationally, it’s more manual than desired

Requires ZooKeeper

BASIC CONCEPTS

Topics - logical namespace for data (clickstream, app logs)

Partition - physical separation of data to allow for horizontal scalability

Consumer Groups/Offsets - Where your consumer group last check pointed in the stream

Replica - allows for partitions to be replicated across nodes for availability, only one is the active leader

USE CASES

First point for data ingestion, provide back pressure to downstream

Provide a data firehose for clients (with seeks)

Friendly to Blue/Green deployment architectures

Mirroring test data easily

Data Center log aggregation

Seamless Integration with Storm

Data Center Aggregation

Producer

API Server

Customer A Customer B

Data Stream

Serving a Firehose

Data Affinity w/ Key Partitioning

Producer

Consumer B

Data Stream P0

Data Stream P1

UserIds 0-100

Consumer A

UserIds 0-100 UserIds 101-200

Producer

Blue Consumer

InactiveTopic

ActiveTopic

Blue/Green Deployment

ZooKeeperController

Producer

Blue Consumer

InactiveTopic

ActiveTopic


ZooKeeperController

Green Consumer

Producer

Blue Consumer

InactiveTopic

ActiveTopic


ZooKeeper

Green Consumer

ControllerUser: 555

Producer

Blue Consumer

InactiveTopic

ActiveTopic


Green Consumer

ControllerUser: 555ZooKeeper

SCALING OUT

1 partition = 1 consumer

1 partition needs to fit on a single machine

Partitions = the scalability of your system from the producer and consumer side

For high scale apps you will probably start out with 100 partitions

ProducerConsumer AP1

P0

P2

Producer

Consumer A

P1

P0

P2

Consumer B

Consumer C

MONITORING

http://quantifind.com/KafkaOffsetMonitor/

http://quantifind.com/KafkaOffsetMonitor/

ZOOKEEPERhttp://techblog.netflix.com/2012/04/introducing-exhibitor-supervisor-

system.html

WE’RE [email protected]@jimplushcrowdstrike.com/about-us/careers

mailto:[email protected]

http://crowdstrike.com/about-us/careers

Producer A

Producer B

ZooKeeper

Partition 1

Partition 2

ClickStream

Partition OffsetsCommit Offset

Consumer A