an introduction to apache kafka

63
By Amir Sedighi @amirsedighi Data Solutions Engineer at DatisPars Nov 2014

Upload: amir-sedighi

Post on 21-Apr-2017

1.740 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: An Introduction to Apache Kafka

By Amir Sedighi@amirsedighi

Data Solutions Engineer at DatisPars

Nov 2014

Page 3: An Introduction to Apache Kafka

3

At first data pipelining looks easy!

● It often starts with one data pipeline from a producer to a consumer.

Page 4: An Introduction to Apache Kafka

4

It looks pretty wise either to reuse things!

● Reusing the pipeline for new producers.

Page 5: An Introduction to Apache Kafka

5

We may handle some situations!

● Reusing added producers for new consumers.

Page 6: An Introduction to Apache Kafka

6

But we can't go far!

● Eventually the solution becomes the problem!

Page 7: An Introduction to Apache Kafka

7

The additional requirements make things complicated!

● By later developments it gets even worse!

Page 8: An Introduction to Apache Kafka

8

How to avoid this mess?

Page 9: An Introduction to Apache Kafka

9

Decoupling Data-Pipelines

Page 10: An Introduction to Apache Kafka

10

Message Delivery Semantics

● At most once

– Messages may be lost by are never delivered.

● At least once

– Messages are never lost byt may be redliverd.

● Exactly once

– This is what people actually want.

Page 11: An Introduction to Apache Kafka

11

Apache Kafka is publish-subscribe messaging

rethought as a distributed commit log.

Page 12: An Introduction to Apache Kafka

12

Apache Kafka

● Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

– Kafka is super fast.

– Kafka is scalable.

– Kafka is durable.

– Kafka is distributed by design.

Page 13: An Introduction to Apache Kafka

13

Apache Kafka

● Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

– Kafka is super fast.

– Kafka is scalable.

– Kafka is durable.

– Kafka is distributed by design.

Page 14: An Introduction to Apache Kafka

14

Apache Kafka

● Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

– Kafka is super fast.

– Kafka is scalable.

– Kafka is durable.

– Kafka is distributed by design.

Page 15: An Introduction to Apache Kafka

15

Apache Kafka

● A single Kafka broker (server) can handle hundreds of megabytes of reads and writes per second from thousands of clients.

Page 16: An Introduction to Apache Kafka

16

Apache Kafka

● Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

– Kafka is super fast.

– Kafka is scalable.

– Kafka is durable.

– Kafka is distributed by design.

Page 17: An Introduction to Apache Kafka

17

Apache Kafka

● Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime.

Page 18: An Introduction to Apache Kafka

18

Apache Kafka

● Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

– Kafka is super fast.

– Kafka is scalable.

– Kafka is durable.

– Kafka is distributed by design.

Page 19: An Introduction to Apache Kafka

19

Apache Kafka

● Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.

Page 20: An Introduction to Apache Kafka

20

Apache Kafka

● Apache Kafka is publish-subscribe messaging rethought as a distributed commit log.

– Kafka is super fast.

– Kafka is scalable.

– Kafka is durable.

– Kafka is distributed by design.

Page 21: An Introduction to Apache Kafka

21

Apache Kafka

● Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

Page 22: An Introduction to Apache Kafka

22

Kafka in Linkedin

Page 23: An Introduction to Apache Kafka

23

Page 24: An Introduction to Apache Kafka

24

Kafka is a distributed, partitioned, replicated commit log service.

Page 25: An Introduction to Apache Kafka

25

Main Components

● Topic

● Producer

● Consumer

● Broker

Page 26: An Introduction to Apache Kafka

26

Topic

● Topic

● Producer

● Consumer

● Broker

● Kafka maintains feeds of messages in categories called topics.

● Topics are the highest level of abstraction that Kafka provides.

Page 27: An Introduction to Apache Kafka

27

Topic

Page 28: An Introduction to Apache Kafka

28

Topic

Page 29: An Introduction to Apache Kafka

29

Topic

Page 30: An Introduction to Apache Kafka

30

Producer

● Topic

● Producer

● Consumer

● Broker

● We'll call processes that publish messages to a Kafka topic producers.

Page 31: An Introduction to Apache Kafka

31

Producer

Page 32: An Introduction to Apache Kafka

32

Producer

Page 33: An Introduction to Apache Kafka

33

Producer

Page 34: An Introduction to Apache Kafka

34

Consumer

● Topic

● Producer

● Consumer

● Broker

● We'll call processes that subscribe to topics and process the feed of published messages, consumers.

– Hadoop Consumer

Page 35: An Introduction to Apache Kafka

35

Consumer

Page 36: An Introduction to Apache Kafka

36

Broker

● Topic

● Producer

● Consumer

● Broker

● Kafka is run as a cluster comprised of one or more servers each of which is called a broker.

Page 37: An Introduction to Apache Kafka

37

Broker

Page 38: An Introduction to Apache Kafka

38

Broker

Page 39: An Introduction to Apache Kafka

39

Topics

● A topic is a category or feed name to which messages are published.

● Kafka cluster maintains a partitioned log for each topic.

Page 40: An Introduction to Apache Kafka

40

Partition

● Is an ordered, immutable sequence of messages that is continually appended to a commit log.

● The messages in the partitions are each assigned a sequential id number called the offset.

Page 41: An Introduction to Apache Kafka

41

Partition

Page 42: An Introduction to Apache Kafka

42

Again Topic and Partition

Page 43: An Introduction to Apache Kafka

43

Log Compaction

Page 44: An Introduction to Apache Kafka

44

Producer

● The producer is responsible for choosing which message to assign to which partition within the topic.

– Round-Robin

– Load-Balanced

– Key-Based (Semantic-Oriented)

Page 45: An Introduction to Apache Kafka

45

Log Compaction

Page 46: An Introduction to Apache Kafka

46

How a Kafka cluster looks Like?

Page 47: An Introduction to Apache Kafka

47

How Kafka replicates a Topic's partitions through the cluster?

Page 48: An Introduction to Apache Kafka

48

Logical Consumers

Page 49: An Introduction to Apache Kafka

49

What if we put jobs (Processors) cross the flow?

Page 50: An Introduction to Apache Kafka

50

Where to Start?

● http://kafka.apache.org/downloads.html

Page 51: An Introduction to Apache Kafka

51

Run Zookeeper

● bin/zookeeper-server-start.sh config/zookeeper.properties

Page 52: An Introduction to Apache Kafka

52

Run kafka-server

● bin/kafka-server-start.sh config/server.properties

Page 53: An Introduction to Apache Kafka

53

Create Topic

● bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

> Created topic "test".

Page 54: An Introduction to Apache Kafka

54

List all Topics

● bin/kafka-topics.sh --list --zookeeper localhost:2181

Page 55: An Introduction to Apache Kafka

55

Send some Messages by Producer

● bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

Hello DatisPars Guys!

How is it going with you?

Page 56: An Introduction to Apache Kafka

56

Start a Consumer

● bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning

Page 57: An Introduction to Apache Kafka

57

Producing ...

Page 58: An Introduction to Apache Kafka

58

Consuming

Page 59: An Introduction to Apache Kafka

59

Use Cases

● Messaging

– Kafka is comparable to traditional messaging systems such as ActiveMQ and RabbitMQ.

● Kafka provides customizable latency● Kafka has better throughput● Kafka is highly Fault-tolerance

Page 60: An Introduction to Apache Kafka

60

Use Cases

● Log Aggregation

– Many people use Kafka as a replacement for a log aggregation solution.

– Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing.

– In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.

● Lower-latency● Easier support

Page 61: An Introduction to Apache Kafka

61

Use Cases

● Stream Processing– Storm and Samza are popular frameworks for stream processing. They

both use Kafka.

● Event Sourcing– Event sourcing is a style of application design where state changes are

logged as a time-ordered sequence of records. Kafka's support for very large stored log data makes it an excellent backend for an application built in this style.

● Commit Log– Kafka can serve as a kind of external commit-log for a distributed

system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data.

Page 62: An Introduction to Apache Kafka

62

Message Format

● /** ● * A message. The format of an N byte message is the following: ● * If magic byte is 0 ● * 1. 1 byte "magic" identifier to allow format changes ● * 2. 4 byte CRC32 of the payload ● * 3. N - 5 byte payload ● * If magic byte is 1 ● * 1. 1 byte "magic" identifier to allow format changes ● * 2. 1 byte "attributes" identifier to allow annotations on the message independent of the

version (e.g. compression enabled, type of codec used) ● * 3. 4 byte CRC32 of the payload ● * 4. N - 6 byte payload ● */

Page 63: An Introduction to Apache Kafka

63

Questions?