kafka. seattle data science and data engineering meetup

19

Click here to load reader

Upload: abhishek-goswami

Post on 23-Jan-2018

231 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Kafka. seattle data science and data engineering meetup

Seattle Data Science And Data Engineering Meetup

Abhishek Goswami.

12/14/2016

[email protected]

https://www.linkedin.com/in/abgoswam

Page 2: Kafka. seattle data science and data engineering meetup

Table Of ContentIntroduction

Motivation

What is Kafka

Characteristics

APIs

Demos

Internals

Logs

Logs in Distributed Systems

Design Fundamentals

ZooKeeper Dependency

Replication

Source Code

Summary, Q&A

2

Page 3: Kafka. seattle data science and data engineering meetup

● Introduction○ Motivation

○ What is Kafka?

○ Characteristics

○ APIs

○ Demos

● Internals

● Summary, Q&A

3

Page 4: Kafka. seattle data science and data engineering meetup

Introduction: Motivation

4

Data integration.

Page 5: Kafka. seattle data science and data engineering meetup

Introduction: What is Kafka ?Distributed, partitioned, replicated commit-log service

Provides the functionality of a messaging system, but with a unique-design

5

Competitive Landscape:● AWS Kinesis, Azure EventHub

Use Cases:● Messaging

● Website Activity Tracking

● Logging

● Stream Processing

Page 6: Kafka. seattle data science and data engineering meetup

Introduction: Characteristics

6

Scalability of a filesystem

High Throughput

Many TB per server

Guarantees of a database

Messages strictly ordered

All data persistent

Distributed by default

Replication

Partitioning

Page 7: Kafka. seattle data science and data engineering meetup

Introduction: APIsFour core APIs:

Producer API

allows applications to send streams of data to topics in the Kafka cluster.

Consumer API

allows applications to read streams of data from topics in the Kafka cluster.

Connect API

allows implementing connectors that continually pull from some source system or application into

Kafka or push from Kafka into some sink system or application.

Streams API

generalization of batch processing in a real time environment, low latency requirements.

7

Page 8: Kafka. seattle data science and data engineering meetup

Introduction: Demos

8

Page 9: Kafka. seattle data science and data engineering meetup

● Introduction

● Internals○ Log

○ Logs in Distributed Systems

○ Design Fundamentals

○ ZooKeeper Dependency

○ Replication

○ Source Code

● Summary, Q&A

9

Page 10: Kafka. seattle data science and data engineering meetup

Internals: Log

10

Page 11: Kafka. seattle data science and data engineering meetup

Internals: Logs in Distributed Systems

11

Page 12: Kafka. seattle data science and data engineering meetup

Internals: Logs in Distributed Systems

12

Page 13: Kafka. seattle data science and data engineering meetup

Internals: Design Fundamentals

13

Page 14: Kafka. seattle data science and data engineering meetup

Internals: ZooKeeper DependencyKafka requires ZooKeeper

Kafka uses ZooKeeper to do things like:

Cluster membership

Electing a controller

Topic Configuration (which topic exists, who’s the leader etc)

14

Page 15: Kafka. seattle data science and data engineering meetup

Internals: Replication

15

Page 16: Kafka. seattle data science and data engineering meetup

Internals: Source CodeGithub Repo

https://github.com/apache/kafka

16

Page 17: Kafka. seattle data science and data engineering meetup

● Introduction

● Internals

● Summary, Q&A

17

Page 18: Kafka. seattle data science and data engineering meetup

Summary

18

Kafka solves data integration needs.

Distributed, partitioned, replicated commit-log service

Page 19: Kafka. seattle data science and data engineering meetup

Q&A

19

References:

1. Simplifying data pipelines with Apache Kafka

2. Learning Apache Kafka, 2nd Edition

3. https://www.tutorialspoint.com/apache_kafka/index.htm

4. https://www.infoq.com/articles/apache-kafka

5. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-

should-know-about-real-time-datas-unifying

[email protected]

https://www.linkedin.com/in/abgoswam