apache kafka

34
Apache Kafka A high-throughput distributed messaging system Johan Lundahl

Upload: zanthe

Post on 25-Feb-2016

74 views

Category:

Documents


5 download

DESCRIPTION

Apache Kafka. A high-throughput distributed messaging system. Johan Lundahl. Agenda. Kafka overview Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts Partitioning, Keyed messages, Replication - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Apache Kafka

Apache KafkaA high-throughput distributed messaging system

Johan Lundahl

Page 2: Apache Kafka

Agenda• Kafka overview

– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts

– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo

2

Page 3: Apache Kafka

What is Apache Kafka?• Distributed, high-throughput, pub-sub messaging system

– Fast, Scalable, Durable

• Main use cases: – log aggregation, real-time processing, monitoring, queueing

• Originally developed by LinkedIn• Implemented in Scala/Java• Top level Apache project since 2012: http://kafka.apache.org/

Page 4: Apache Kafka

4

Comparison to other messaging systems– Traditional: JMS, xxxMQ/AMQP– New gen: Kestrel, Scribe, Flume, Kafka

Kafka

Message queuesLow throughput, low latency

JMS

ActiveMQ

Qpid

RabbitMQ

Log aggregatorsHigh throughput, high latency

Kestrel

Scribe

Flume Hedwig

Batch jobs

Page 5: Apache Kafka

5

Frontend ServiceFrontend

Monitoring Stream processing

Batch processing

Data warehouse

Kafka

Producers

Broker

Consumers

Topic1Topic1

Topic2Topic3

Topic1

Topic3Topic2 Topic3

Topic3

Topic2

Topic1

Push

Pull

Kafka concepts

Page 6: Apache Kafka

Distributed model

6

Producer Producer Producer

Broker Broker Broker

Topic1 consumer group Topic2 consumer group

Partitioned Data Publication

Ordered subscription

Intra cluster replication

Producer persistence

KAFKA-156

Zookeeper

Page 7: Apache Kafka

Agenda• Kafka overview

– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts

– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo

7

Page 8: Apache Kafka

Performance factors• Broker doesn’t track consumer state• Everything is distributed• Zero-copy (sendfile) reads/writes• Usage of page cache backed by sequential

disk allocation• Like a distributed commit log

• Low overhead protocol• Message batching (Producer & Consumer)• Compression (End to end)• Configurable ack levels

8

From: http://queue.acm.org/detail.cfm?id=1563874

Page 9: Apache Kafka

Kafka features and strengths

• Simple model, focused on high throughput and durability• O(1) time persistence on disk• Horizontally scalable by design (broker and consumers)• Push - pull => consumer burst tolerance• Replay messages• Multiple independent subscribes per topic• Configurable batching, compression, serialization• Online upgrades

9

Page 10: Apache Kafka

Tradeoffs• Not optimized for millisecond latencies• Have not beaten CAP• Simple messaging system, no processing• Zookeeper becomes a bottleneck when using too many topics/partitions

(>>10000)• Not designed for very large payloads (full HD movie etc.)• Helps to know your data in advance

10

Page 11: Apache Kafka

Agenda• Kafka overview

– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts

– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo

11

Page 12: Apache Kafka

Message/Log Format

LengthVersion

ChecksumPayload

Message

Page 13: Apache Kafka

Log based queue (Simplified model)

Message1

Topic2

Message2

Message3

Message4

Message5

Message6

Message7

Producer1 Consumer2

Producer2

Consumer1Message1

Message2

Message3

Message4

Message5

Message6

Message7

Message8

Message9

Message10

Topic1

Broker

Consumer3

Consumer3

Consumer3

ConsumerGroup1 • Batching• Compression• Serialization

Producer API used directly by application or through one of the contributed implementations, e.g. log4j/logback appender

Page 14: Apache Kafka

Broker

Producer

Producer

Producer

Producer

Producer

Topic1

Topic2

Partitions

Partitioning

Consumer

Consumer

Consumer

ConsumerGroup2

Consumer

Consumer

Group1

Group3

ConsumerNo partition for this guy

Page 15: Apache Kafka

Keyed messages

Producer

Message1

Message5

Message9

Message13

Message17

Topic1

BrokerId=1

Message2

Message4

Message6

Message8

Message10

Message12

Message14

Message16

Message18

Topic1

BrokerId=2

Message3

Message7

Message11

Message15

Topic1

BrokerId=3

hash(key) % #partitions#partitions=3

Page 16: Apache Kafka

Intra cluster replication

Message1

Message2

Message3

Message4

Message5

Message6

Message7

Message8

Message9

Message10

Topic1 leader

Broker1

Message1

Message2

Message3

Message4

Message5

Message6

Message7

Message8

Message9

Message10

Topic1 follower

Broker2

Message1

Message2

Message3

Message4

Message5

Message6

Message7

Message8

Message9

Topic1 follower

Broker3

Producer ackackackack

Replication factor = 3

Message10

InSyncReplicas

Commit mode Latency Durability

Fire & Forget “none” Weak

Leader ack 1 roundtrip Medium

Full replication 2 roundtrips Strong

Follower fails:• Follower dropped from ISR • When follower comes online again: fetch

data from leader, then ISR gets updated

Leader fails:• Detected via Zookeeper from ISR• New leader gets elected

3 commit modes:

Page 17: Apache Kafka

Agenda• Kafka overview

– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts

– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo

17

Page 18: Apache Kafka

Producer API

…or for log aggregation:

Configuration parameters:ProducerType (sync/async)CompressionCodec (none/snappy/gzip)BatchSizeEnqueueSize/TimeEncoder/SerializerPartitioner#RetriesMaxMessageSize…

Page 19: Apache Kafka

Consumer API(s)• High-level (consumer group, auto-commit)• Low-level (simple consumer, manual commit)

Page 20: Apache Kafka

Agenda• Kafka overview

– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts

– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo

20

Page 21: Apache Kafka

Broker Protips

• Reasonable number of partitions – will affect performance• Reasonable number of topics – will affect performance• Performance decrease with larger Zookeeper ensembles• Disk flush rate settings• message.max.bytes – max accept size, should be smaller than the heap• socket.request.max.bytes – max fetch size, should be smaller than the heap• log.retention.bytes – don’t want to run out of disk space…• Keep Zookeeper logs under control for same reason as above• Kafka brokers have been tested on Linux and Solaris

Page 22: Apache Kafka

Operating Kafka

• Zookeeper usage– Producer loadbalancing– Broker ISR– Consumer tracking

• Monitoring– JMX– Audit trail/console in the making

Distribution Tools:• Controlled shutdown tool• Preferred replica leader election tool• List topic tool• Create topic tool• Add partition tool• Reassign partitions tool• MirrorMaker

Page 23: Apache Kafka

Multi-datacenter replication

23

Page 24: Apache Kafka

Agenda• Kafka overview

– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts

– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo

24

Page 25: Apache Kafka

Ecosystem

Producers:• Java (in standard dist)• Scala (in standard dist)• Log4j (in standard dist)• Logback: logback-kafka• Udp-kafka-bridge• Python: kafka-python• Python: pykafka• Python: samsa• Python: pykafkap• Python: brod• Go: Sarama• Go: kafka.go• C: librdkafka• C/C++: libkafka• Clojure: clj-kafka• Clojure: kafka-clj• Ruby: Poseidon• Ruby: kafka-rb• Ruby: em-kafka• PHP: kafka-php(1)• PHP: kafka-php(2)• PHP: log4php• Node.js: Prozess• Node.js: node-kafka• Node.js: franz-kafka• Erlang: erlkafka

Consumers:• Java (in standard dist)• Scala (in standard dist)• Python: kafka-python• Python: samsa• Python: brod• Go: Sarama• Go: nuance• Go: kafka.go• C/C++: libkafka• Clojure: clj-kafka• Clojure: kafka-clj• Ruby: Poseidon• Ruby: kafka-rb• Ruby: Kafkaesque• Jruby::Kafka• PHP: kafka-php(1)• PHP: kafka-php(2)• Node.js: Prozess• Node.js: node-kafka• Node.js: franz-kafka• Erlang: erlkafka• Erlang: kafka-erlang

Common integration points:Stream ProcessingStorm - A stream-processing framework.Samza - A YARN-based stream processing framework.Hadoop IntegrationCamus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great.Kafka Hadoop Loader A different take on Hadoop loading functionality from what is included in the main distribution.AWS IntegrationAutomated AWS deploymentKafka->S3 MirroringLoggingklogd - A python syslog publisherklogd2 - A java syslog publisherTail2Kafka - A simple log tailing utilityFluentd plugin - Integration with FluentdFlume Kafka Plugin - Integration with FlumeRemote log viewerLogStash integration - Integration with LogStash and FluentdOfficial logstash integrationMetricsMozilla Metrics Service - A Kafka and Protocol Buffers based metrics and logging systemGanglia IntegrationPacking and DeploymentRPM packagingDebian packaginghttps://github.com/tomdz/kafka-deb-packagingPuppet integrationDropwizard packagingMisc.Kafka Mirror - An alternative to the built-in mirroring toolRuby Demo App Apache Camel IntegrationInfobright integration

Page 26: Apache Kafka

What’s in the future?• Topic and transient consumer garbage collection (KAFKA-560/KAFKA-559)• Producer side persistence (KAFKA-156/KAFKA-789)• Exact mirroring (KAFKA-658)• Quotas (KAFKA-656)• YARN integration (KAFKA-949)• RESTful proxy (KAFKA-639)• New build system? (KAFKA-855)• More tooling (Console, Audit trail) (KAFKA-266/KAFKA-260)• Client API rewrite (Proposal)• Application level security (Proposal)

Page 27: Apache Kafka

Agenda• Kafka overview

– Main concepts and comparisons to other messaging systems• Features, strengths and tradeoffs• Message format and broker concepts

– Partitioning, Keyed messages, Replication• Producer / Consumer APIs• Operation considerations• Kafka ecosystemIf time permits:• Kafka as a real-time processing backbone • Brief intro to Storm• Kafka-Storm wordcount demo

27

Page 28: Apache Kafka

Stream processingKafka as a processing pipeline backbone

Producer

Producer

Producer

Kafka topic1

Kafka topic2

Process1

Process1

Process1

Process2

Process2

Process2

System1 System2

Page 29: Apache Kafka

29

What is Storm?

• Distributed real-time computation system with design goals:– Guaranteed processing– No orphaned tasks– Horizontally scalable– Fault tolerant– Fast

• Use cases: Stream processing, DRPC, Continuous computation• 4 basic concepts: streams, spouts, bolts, topologies• In Apache incubator• Implemented in Clojure

Page 30: Apache Kafka

30

Streams

(t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1)

(timestamp,sessionid,exception stacktrace)

Spoutsa source of streams

(t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1)

Connects to queues, logs, API calls, event data.

Some features like transactional topologies (which gives exactly-once messaging semantics) is only possible using the Kafka-TransactionalSpout-consumer

an [infinite] sequence (of tuples)

Page 31: Apache Kafka

31

Bolts

(t2,s1,h2)(t1,s1,h1)

(t3,s3)(t4,s2,e2)(t5,s4)

• Filters• Transformations• Apply functions• Aggregations• Access DB, APIs etc.• Emitting new streams• Trident = a high level abstraction on top of Storm

Page 32: Apache Kafka

32

Topologies

(t2,s1,h2)(t1,s1,h1)

(t3,s3)

(t4,s2,e2)(t5,s4)

(t6,s6)

(t7,s7)

(t8,s8)

Page 33: Apache Kafka

33

Storm cluster

Nimbus

Supervisor SupervisorSupervisorSupervisorSupervisor

Zookeeper

Topology

Deploy

(JobTracker)

Compare with Hadoop:

(TaskTrackers)

Mesos/YARN

Page 34: Apache Kafka

Links

34

Apache Kafka:Papers and presentationsMain project pageSmall Mediawiki case studyStorm:Introductory articleRealtime discussing blog postKafka+Storm for realtimeBigData Trifecta blog post: Kafka+Storm+CassandraIBM developer articleKafka+Storm@TwitterBigData Quadfecta blog post