kafka. seattle data science and data engineering meetup

Post on 23-Jan-2018

231 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Seattle Data Science And Data Engineering Meetup

Abhishek Goswami.

12/14/2016

abgoswam@gmail.com

https://www.linkedin.com/in/abgoswam

Table Of ContentIntroduction

Motivation

What is Kafka

Characteristics

APIs

Demos

Internals

Logs

Logs in Distributed Systems

Design Fundamentals

ZooKeeper Dependency

Replication

Source Code

Summary, Q&A

2

● Introduction○ Motivation

○ What is Kafka?

○ Characteristics

○ APIs

○ Demos

● Internals

● Summary, Q&A

3

Introduction: Motivation

4

Data integration.

Introduction: What is Kafka ?Distributed, partitioned, replicated commit-log service

Provides the functionality of a messaging system, but with a unique-design

5

Competitive Landscape:● AWS Kinesis, Azure EventHub

Use Cases:● Messaging

● Website Activity Tracking

● Logging

● Stream Processing

Introduction: Characteristics

6

Scalability of a filesystem

High Throughput

Many TB per server

Guarantees of a database

Messages strictly ordered

All data persistent

Distributed by default

Replication

Partitioning

Introduction: APIsFour core APIs:

Producer API

allows applications to send streams of data to topics in the Kafka cluster.

Consumer API

allows applications to read streams of data from topics in the Kafka cluster.

Connect API

allows implementing connectors that continually pull from some source system or application into

Kafka or push from Kafka into some sink system or application.

Streams API

generalization of batch processing in a real time environment, low latency requirements.

7

Introduction: Demos

8

● Introduction

● Internals○ Log

○ Logs in Distributed Systems

○ Design Fundamentals

○ ZooKeeper Dependency

○ Replication

○ Source Code

● Summary, Q&A

9

Internals: Log

10

Internals: Logs in Distributed Systems

11

Internals: Logs in Distributed Systems

12

Internals: Design Fundamentals

13

Internals: ZooKeeper DependencyKafka requires ZooKeeper

Kafka uses ZooKeeper to do things like:

Cluster membership

Electing a controller

Topic Configuration (which topic exists, who’s the leader etc)

14

Internals: Replication

15

Internals: Source CodeGithub Repo

https://github.com/apache/kafka

16

● Introduction

● Internals

● Summary, Q&A

17

Summary

18

Kafka solves data integration needs.

Distributed, partitioned, replicated commit-log service

Q&A

19

References:

1. Simplifying data pipelines with Apache Kafka

2. Learning Apache Kafka, 2nd Edition

3. https://www.tutorialspoint.com/apache_kafka/index.htm

4. https://www.infoq.com/articles/apache-kafka

5. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-

should-know-about-real-time-datas-unifying

abgoswam@gmail.com

https://www.linkedin.com/in/abgoswam

top related