kafka. seattle data science and data engineering meetup
TRANSCRIPT
Seattle Data Science And Data Engineering Meetup
Abhishek Goswami.
12/14/2016
https://www.linkedin.com/in/abgoswam
Table Of ContentIntroduction
Motivation
What is Kafka
Characteristics
APIs
Demos
Internals
Logs
Logs in Distributed Systems
Design Fundamentals
ZooKeeper Dependency
Replication
Source Code
Summary, Q&A
2
● Introduction○ Motivation
○ What is Kafka?
○ Characteristics
○ APIs
○ Demos
● Internals
● Summary, Q&A
3
Introduction: Motivation
4
Data integration.
Introduction: What is Kafka ?Distributed, partitioned, replicated commit-log service
Provides the functionality of a messaging system, but with a unique-design
5
Competitive Landscape:● AWS Kinesis, Azure EventHub
Use Cases:● Messaging
● Website Activity Tracking
● Logging
● Stream Processing
Introduction: Characteristics
6
Scalability of a filesystem
High Throughput
Many TB per server
Guarantees of a database
Messages strictly ordered
All data persistent
Distributed by default
Replication
Partitioning
Introduction: APIsFour core APIs:
Producer API
allows applications to send streams of data to topics in the Kafka cluster.
Consumer API
allows applications to read streams of data from topics in the Kafka cluster.
Connect API
allows implementing connectors that continually pull from some source system or application into
Kafka or push from Kafka into some sink system or application.
Streams API
generalization of batch processing in a real time environment, low latency requirements.
7
Introduction: Demos
8
● Introduction
● Internals○ Log
○ Logs in Distributed Systems
○ Design Fundamentals
○ ZooKeeper Dependency
○ Replication
○ Source Code
● Summary, Q&A
9
Internals: Log
10
Internals: Logs in Distributed Systems
11
Internals: Logs in Distributed Systems
12
Internals: Design Fundamentals
13
Internals: ZooKeeper DependencyKafka requires ZooKeeper
Kafka uses ZooKeeper to do things like:
Cluster membership
Electing a controller
Topic Configuration (which topic exists, who’s the leader etc)
14
Internals: Replication
15
Internals: Source CodeGithub Repo
https://github.com/apache/kafka
16
● Introduction
● Internals
● Summary, Q&A
17
Summary
18
Kafka solves data integration needs.
Distributed, partitioned, replicated commit-log service
Q&A
19
References:
1. Simplifying data pipelines with Apache Kafka
2. Learning Apache Kafka, 2nd Edition
3. https://www.tutorialspoint.com/apache_kafka/index.htm
4. https://www.infoq.com/articles/apache-kafka
5. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-
should-know-about-real-time-datas-unifying
https://www.linkedin.com/in/abgoswam