Developing Real-Time Data Pipelines with Apache Kafka

Download Developing Real-Time Data Pipelines with Apache Kafka

Post on 12-Jan-2017

2.529 views

Category:

Technology

1 download

TRANSCRIPT

  • SPRINGONE2GX WASHINGTON,

    DC

    Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Developing Real-Time Data Pipelines with Apache Kafka

    Joe Stein @allthingshadoop

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    CEO of Elodina http://www.elodina.net/ a big data as a service platform built on top open source software. The Elodina platform enables customers to analyze data streams and programmatically react to the results in real-time. We solve todays data analytics needs by providing the tools and support necessary to utilize open source technologies. As users, contributors and committers, Elodina also provides support for frameworks that run on Mesos including Apache Kafka, Exhibitor (Zookeeper), Apache Storm, Apache Cassandra and a whole lot more!

    Apache Kafka Committer & PMC Member LinkedIn: http://linkedin.com/in/charmalloc Twitter : @allthingshadoop

    whoami

    2

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Contents Introduction To Kafka

    Overview Topics, Partitions & Segments Data Durability Replication Producers Consumers Performance Integration Quick Start Operations

    3

    Designs Distributed RPC

    o Request o Process o Response

    Storage & Analytics o Stream o Transform o Analyze o Store o Search

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Apache Kafka

    4

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Apache Kafka Apache Kafka was first open sourced by LinkedIn in 2011 Papers Building a Replicated Logging System with Apache Kafka http://www.vldb.org/pvldb/vol8/p1654-wang.pdf

    Kafka: A Distributed Messaging System for Log Processing http://research.microsoft.com/en-us/um/people/srikanth/netdb11/

    netdb11papers/netdb11-final12.pdf Building LinkedIns Real-time Activity Data Pipeline http://sites.computer.org/debull/A12june/pipeline.pdf

    The Log: What Every Software Engineer Should Know About Real-time Data's Unifying Abstraction http://

    engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

    http://kafka.apache.org/

    5

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    How Big Data Usually Starts

    6

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    More Big Data!

    7

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Ah!

    8

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    eesh

    9

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Kafka de-couples data pipelines

    10

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Distributed Replicated Log

    Read and write In real time As much as you want As fast as your network can go

    11

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Topics and Partitions

    12

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Log Segments

    13

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Distributed Replicated Log

    14

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Data Durability

    15

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Replication

    16

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Producers

    17

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Consumers

    18

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Consumer Failover

    19

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Producer Performance

    20

    https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Consumer Performance

    http://kafka.apache.org/documentation.html#maximizingefficiency

    21

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Client Libraries Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer

    implementations included, GZIP and Snappy compression supported.

    Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported.

    C - High performance C library with full protocol support Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy

    compression supported. Ruby 1.9.3 and up (CI runs MRI 2. Clojure - Clojure DSL for the Kafka API JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation

    Wire Protocol Developer's Guide https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol

    22

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Spring Integration

    Good blog about it https://spring.io/blog/2015/04/15/using-apache-kafka-for-integration-and-data-processing-pipelines-with-spring

    Kafka Integration Source

    https://github.com/spring-projects/spring-integration-kafka Spring XD samples https://github.com/spring-projects/spring-xd-samples/tree/master/kafka-source

    23

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Quick Start

    https://kafka.apache.org/documentation.html#quickstart Download the 0.8.2.2 release and un-tar it. > tar -xzf kafka_2.10-0.8.2.2.tgz > cd kafka_2.10-0.8.2.2 (use at least four terminal windows) > bin/zookeeper-server-start.sh config/zookeeper.properties > bin/kafka-server-start.sh config/server.properties > bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test This is a message This is another message > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message

    24

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Operationalizing Kafka

    https://kafka.apache.org/documentation.html#basic_ops

    Basic Kafka Operations

    Adding and removing topics

    Modifying topics

    Graceful shutdown

    Balancing leadership

    Checking consumer position

    Mirroring data between clusters

    Expanding your cluster

    Decommissioning brokers

    Increasing replication factor

    25

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Running on Mesos

    26

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Static Partitioning

    27

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Scaling is manual (even if orchestrated)

    28

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Static failures require manual intervention

    29

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Application Elasticity

    30

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    An operating system for your data center

    31

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Everything goes on Mesos

    32

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Kafka on Mesos

    https://github.com/mesos/kafka

    smart broker.id assignment. preservation of broker placement (through constraints and/or new features). ability to-do configuration changes. rolling restarts (for things like configuration changes). scaling the cluster up and down with automatic, programmatic and manual

    options.

    smart partition assignment via constraints visa vi roles, resources and attributes.

    33

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Kafka on Mesos

    Scheduler Provides the operational automation for a Kafka Cluster. Manages the changes to the broker's configuration. Exposes a REST API for the CLI to use or any other client. Runs on Marathon for high availability.

    Executor The executor interacts with the kafka broker as an intermediary to the scheduler

    34

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    REST API & CLI scheduler - starts the scheduler. add - adds one more more brokers to the cluster. update - changes resources, constraints or broker properties one or more brokers. remove - take a broker out of the cluster. start - starts a broker up. stop - this can either a graceful shutdown or will force kill it (./kafka-mesos.sh help stop) rebalance - allows you to rebalance a cluster either by selecting the brokers or topics to rebalance. Manual

    assignment is still possible using the Apache Kafka project tools. Rebalance can also change the replication factor on a topic.

    help - ./kafka-mesos.sh help || ./kafka-mesos.sh help {command}

    35

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Launch 20 brokers in seconds

    36

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Kafka 0.9 KIP (Kafka Improvement Process)

    https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals New Consumer

    https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design

    Security https://cwiki.apache.org/confluence/display/KAFKA/Security https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=51809888

    JIRA

    https://issues.apache.org/jira/browse/KAFKA/fixforversion/12328745/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel

    37

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Distributed RPC

    38

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Reference Architecture

    39

  • Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

    Questions?

    http://www.elodina.net

    40