Introduction to Apache Kafka

Download Introduction to Apache Kafka

Post on 15-Apr-2017

691 views

Category:

Engineering

0 download

TRANSCRIPT

  • INTRO TO KAFKAJim Plush, Director of Cloud Engineering, CrowdStrike.comTwitter: @jimplush

    http://CrowdStrike.com

  • ABOUT ME

    Jim Plush, Director of Cloud Engineering @ CrowdStrike.com

    Architect of distributed cloud services for catching bad guys

    Previously Director of Engineering at gravity.com

    personalization service, ingesting clickstream from Yahoo!, New York Times, WSJ, etc

    wrote most of the ETL workflow

    http://CrowdStrike.comhttp://gravity.com

  • ABOUT CROWDSTRIKE

    Big Data Security Company

    Near term focus on targeted, state sponsored attacks and attribution

    Single customer can generate 2.2TB of machine data per day we process in our cloud

    Horizontally scalable, distributed infrastructure

    Uses goodies like Kafka, Cassandra, Elastic Search, Hadoop, Scala, Go

  • Said everyone, always

    Some people, when confronted with a problem, think I know, I'll use a message queue. Now they have two problems.

  • APACHE KAFKA

    Its not a so much a queue, but an activity stream system

    Trades stability and speed for consumer complexity

    Its scalable by nature

    Supports data replication

    You can rewind time

    Its fast!

    Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.

  • APACHE KAFKA - CONS

    Consumer Complexity

    Not Rack Aware replication

    Lack of tooling/monitoring

    Still pre 1.0 release

    Operationally, its more manual than desired

    Requires ZooKeeper

  • BASIC CONCEPTS

    Topics - logical namespace for data (clickstream, app logs)

    Partition - physical separation of data to allow for horizontal scalability

    Consumer Groups/Offsets - Where your consumer group last check pointed in the stream

    Replica - allows for partitions to be replicated across nodes for availability, only one is the active leader

  • USE CASES

    First point for data ingestion, provide back pressure to downstream

    Provide a data firehose for clients (with seeks)

    Friendly to Blue/Green deployment architectures

    Mirroring test data easily

    Data Center log aggregation

  • Seamless Integration with Storm

  • Data Center Aggregation

  • Producer

    API Server

    Customer A Customer B

    Data Stream

    Serving a Firehose

  • Data Affinity w/ Key Partitioning

    Producer

    Consumer B

    Data Stream P0

    Data Stream P1

    UserIds 0-100

    Consumer A

    UserIds 0-100 UserIds 101-200

  • Producer

    Blue Consumer

    InactiveTopic

    ActiveTopic

    Blue/Green Deployment

    ZooKeeperController

  • Producer

    Blue Consumer

    InactiveTopic

    ActiveTopic

    Blue/Green Deployment

    ZooKeeperController

    Green Consumer

  • Producer

    Blue Consumer

    InactiveTopic

    ActiveTopic

    Blue/Green Deployment

    ZooKeeper

    Green Consumer

    ControllerUser: 555

  • Producer

    Blue Consumer

    InactiveTopic

    ActiveTopic

    Blue/Green Deployment

    Green Consumer

    ControllerUser: 555ZooKeeper

  • SCALING OUT

    1 partition = 1 consumer

    1 partition needs to fit on a single machine

    Partitions = the scalability of your system from the producer and consumer side

    For high scale apps you will probably start out with 100 partitions

  • ProducerConsumer AP1

    P0

    P2

  • Producer

    Consumer A

    P1

    P0

    P2

    Consumer B

    Consumer C

  • MONITORING

    http://quantifind.com/KafkaOffsetMonitor/

    http://quantifind.com/KafkaOffsetMonitor/

  • ZOOKEEPERhttp://techblog.netflix.com/2012/04/introducing-exhibitor-supervisor-

    system.html

  • WERE HIRING!jim@crowdstrike.com@jimplushcrowdstrike.com/about-us/careers

    mailto:jim@crowdstrike.comhttp://crowdstrike.com/about-us/careers

  • Producer A

    Producer B

    ZooKeeper

    Partition 1

    Partition 2

    ClickStream

    Partition OffsetsCommit Offset

    Consumer A