Apache Kafka - Martin Podval

Download Apache Kafka - Martin Podval

Post on 18-Jul-2015




1 download


<ul><li><p>ApacheKafka</p><p>@MartinPodval, hpsv.cz</p></li><li><p>What is Apache Kafka?</p><p>Messaging SystemDistributedPersistent and ReplicableVery fast - low latency - and scalableSimple but highly configurableBy Linkedin, open sourced under apache.org</p></li><li><p>Data Streaming</p><p>New kind of data ... User or application data (events) streams Monitoring - App, System App Logging High volume</p></li><li><p>Data Streaming Contd</p><p> you want to process Using various components Into a target form Map, reduce, shuffle Real time or batch</p></li><li><p>HP Service Virtualization Use Cases</p><p>Process of clients message streams</p><p>Real-time performance modeling</p><p>Logs aggregation</p></li><li><p>How To Solve It?</p><p>Producers and Consumers Distributed Decoupled Configurable Dynamic</p></li><li><p>Kafka Cluster</p><p>Brokers = Instances, Nodes Topics Partitions Replicas</p><p>ZK Coordination</p></li><li><p>Kafka Topics</p><p>Commit Log Immutable Ordered Sequential Offset</p></li><li><p>Kafka Topics Contd</p><p>PartitionedIndependently: Stored Produced Consumed</p><p> Scalable</p><p>Replicated On partition basis Different brokers</p><p> Fault Tolerant</p></li><li><p>What Can I Do?</p><p>producer.write(topic_id, message);</p><p>consumer.read(topic_id, offset);</p></li><li><p>I Want To Produce</p><p> java/scala client address of one or more brokers choose a topic where to produce highly configurable and tunable:</p><p> partitioner number of acks (async=0, master=1, replicas=1+?) batching, buffer size, timeouts, retries, ...</p></li><li><p>I Want To Consume</p><p>High Level API Groups abstraction</p><p> To All, To One To Some</p><p> Stream API Stores positions to support fault tolerance</p></li><li><p>I Want To Consume Contd</p><p>Low Level Java/scala client Find a leader for a topic Calculate an offset Fetches messages</p><p> Re-consume if needed</p></li><li><p>I Want To Consume Contd</p><p>Delivery Semantic: At most once At least once Exactly once</p></li><li><p>Kafka Internals - Disks</p><p> Avoid: GC Random disk </p><p>access</p></li><li><p>Kafka Internals - Disks Contd</p><p>Disks are fast ...</p><p> when properly used sequential access - read ahead, write behind rely on operating system</p><p> avoid heap, materialization and GC its more like file copy over network</p><p>Its easy with immutable topics</p></li><li><p>Kafka Internals - Replication</p><p>In Sync Replicas Replication factor on partition basis One leader + 0..n replicas Replicas are consumers</p><p> In Sync if they are not too far behind a leader Batch sync</p></li><li><p>Kafka Internals - Replication Contd</p><p>Tunable Trade-Offs Producers write method:</p><p> Not blocked, async Waits for master ACK Waits for all in-sync replicas</p><p> Consumer pulls only committed messages Servers minimum in-sync replicas</p></li><li><p>Performance</p><p> Incredible</p><p>Scales with: clients count, message size number of replicas, partitions or topics</p><p>Depends on network and disk throughput</p></li><li><p>Performance Contd</p><p>Our testing 3 nodes, master + 2 replicas 500 000 msg/s (100 bytes[]) 400 mbit/s - 1.2 gbit/s network throughput end2end latency 2-3 ms</p><p>@see http://bit.ly/1FsIR9a</p></li><li><p>Easy of Use</p><p> No installation, just run a java/scala program</p><p> Streams in files &amp; dirs Transparent zookeeper Ecosystem</p></li><li><p>Cons</p><p> Beta version Dependency on Zookeeper The way how it is written in Scala No easy way how to remove messages</p></li><li><p>Questions?</p></li></ul>


View more >