kafka connect by datio
TRANSCRIPT
Kafka Connect
Contents
Introduction to Kafka123
Data Ingestion
Kafka Connect
1.INTRODUCTION TO KAFKA
Let’s start with the first set of slides
Data Pipeline ProblemFrontend
Server
Metrics Server
Frontend Server
Frontend Server Database
Server
Chat Server
Metrics Server
Metrics UI
Inter-process communication channel
Data Pipeline Problem
Frontend Server
Shopping Cart
Backend Server
Metrics Server
Metrics UI Log Search
Database Server
Frontend Server
Frontend Server
Database Server
Chat Server
Metrics Server
Metrics UI
Metrics Pub/Sub
Metrics Pub/Sub
Logging Pub/Sub
A publish/subscribe System Multiple publish/subscribe Systems
Kafka Goals
Frontend Server
Metrics Server
Metrics UI Log Search
✓ Decouple data pipelines
✓ Provide persistence for message data to allow multiple consumers
✓ Optimize for high throughput of messages
✓ Allow for horizontal scaling of the system to grow as the data stream grow
Database Server Shopping
Cart
Backend Server
ProducerProducer
ProducerProducer
Consumer Consumer Consumer
Broker 1
Topic APartition 0
Broker 2
Topic APartition 1
Broker 3
Topic B
ZOOKEEPER
Kafka Architecture
Kafka Cluster
Topic C
Disk-based retentionScalable
High throughput
USE CASESGold Data
Data Lake
Data
2.DATA INGESTION
Kafka Ingestion
Kafka Producer
Kafka Consumer
Kafka Cluster
Use Case Requirements
DATA LOSS ?
EXCATLY ONCE ?
LATENCY ?
THROUGHPUT ?
Producer Record
KafkaBroker
Topic
Partition
KeyValue
producer.send(record).get
Exception/Metadata
Producer Record
KafkaBroker
Topic
Partition
KeyValue
producer.send(record)
Exception/Metadata
Asynchronous sendSynchronous send
Producer
Producer Record
Serializer
Partitioner
Topic APartition 0
Batch 0
Batch 1
Topic BPartition 1
Batch 0
Batch 1
Fail?
Retry?
Yes
Yes
MetadataException
Send()Topic
Partition
KeyValue
TopicPartition
commit
Metadata
TopicPartitionOffset
BrokerProducer
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12
Consumer 0
Consumer 1
Consumer 2
Consumer Group
Partition 0
Partition 1
Partition 2
Partition 3
Partition 0
Partition 1
Partition 2
Partition 3
Partition 0
Partition 1
Partition 2
Partition 3
Partition 0
Partition 1
Partition 2
Partition 3
Consumer 1
Consumer 2
Consumer 1
Consumer 2
Consumer 3
Consumer 4
Consumer 1
Consumer 2
Consumer 3
Consumer 4
Consumer 5
Consumer 6
Topic T1 Topic T1 Topic T1Consumer Group 1 Consumer Group 1
Consumer Group 1
Consumer
ChannelChannel Processor
Interceptor #1
Interceptor #N
SinkSource
Flume Agent
Reliable
Fault Tolerant
Customizable
Manageable
Centralized Sources
Apache Flume
Consumer
Kafka as reliable flume channel
Flume + Kafka
Source Sink
ChannelProducer
Flume as kafka producer/consumer
3.KAFKA CONNECT
Data Source
Schema Registry
Data Sink
Kaf
ka S
ourc
e C
onne
ct
Kafka Connect
Ingestion integration
Streaming and batch
Scales to the application
Failover control
Accessible connector API
Kaf
ka S
ink
Con
nect
Worker Mode
Connect
Kafka Producer
Kafka Consumer
Worker TaskWorker TaskWorker
TaskThread
Standalone Distributed
Connector
Input N
Input 2
Input 1
Task
Worker
Producer Record
Serializer
Partitioner
Topic APartition 0
Batch 0
Batch 1
Topic BPartition 1
Batch 0
Batch 1
Fail?
Retry?
Yes
Yes
MetadataException
Send()Topic
Partition
KeyValue
TopicPartition
commit
Metadata
TopicPartitionOffset
Broker
Worker settings to ensure no data loss
request.timeout.ms=MAX_VALUE
retries=MAX_VALUE
max.in.flight.request.per.connection=1
acks=all
max.block.ms=MAX_VALUE
Worker 2 Worker 3Worker 1
Worker TaskWorker TaskWorker Task
Workerconfig
Sourceconfig
Sourceconfig
Sourceconfig
Workerconfig
Worker TaskWorker TaskWorker Task
Conn 1, Task 3Partitions: 5,6
Conn 2, Task 1Partitions: 1,2
Conn 2
Conn 1, Task 2Partitions: 3,4
Conn 1
Conn 1, Task 1Partitions: 1,2
Conn 2, Task 2Partitions: 3,4
Standaloneworker
Scalability
Fault tolerance
Share
connectors &
tasks
Distributed Worker
Simple
1 Worker
N conn/tasks
Schema Registry
Consumer
SubjectTopic
SchemaVersion
Worker Task sendRecords()
SourceRecordSourceRecordSourceRecord Producer
RecordProducerRecordProducerRecord
Task
REST API
Serializers
Formaters
Multiple version
of the same
schema
StreamsStreamsStream
Source
Schema
Id Id
Schema
SerializerDeserializer