introduction to apache kafka and real-time · pdf fileapache kafka is publish-subscribe...

44
Introduction to Apache Kafka And Real-Time ETL for Oracle DBAs and Data Analysts 1

Upload: lythien

Post on 01-Feb-2018

238 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Introduction to Apache Kafka And Real-Time ETL

for Oracle DBAs

and Data Analysts 1

Page 2: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

About Myself

• Gwen Shapira – System Architect @Confluent

• Committer @ Apache Kafka, Apache Sqoop

• Author of “Hadoop Application Architectures”, “Kafka – The Definitive Guide”

• Previously:

• Software Engineer @ Cloudera

• Oracle ACE Director

• Senior Consultants @ Pythian

• DBA @ Mercury Interactive

• Find me: • [email protected] • @gwenshap

Page 3: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

There’s a Book on That!

Page 4: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Apache Kafka is

publish-subscribe messaging

rethought as a

distributed commit log.

turned into

a stream data platform

An Optical Illusion

Page 5: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

• Redo Logs

• Pub-sub message queues

• So What is Kafka?

• Awesome use-cases for Kafka

• Data streams and real-time ETL

• Where can you learn more

We’ll talk about:

Page 6: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Transaction Log: The most crucial structure for recovery operations … store all changes made to the database as they occur.

Page 7: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Important Point

The redo log is the only reliable source of information about current state of the database.

Page 8: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Redo Log is used for

• Recover consistent state of a database

• Replicate the database (Dataguard, Streams, GoldenGate…)

If you look far enough into archive logs – you can reconstruct the entire database

Page 9: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical
Page 10: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Publish-Subcribe Messaging

Page 11: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

The Problem

We have more user requests than database connections

Page 12: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

What do we do?

1. “Your call is important to us…”

2. Show them static content

3. Distract them with funny cat photos

4. Prioritize them

5. Acknowledge them and handle the request later

6. Jump them to the head of line

7. Tell them how long the wait is

12

Page 13: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Solution – Message Queue

13

Msg 1

Msg 2

Msg 3

.

.

.

Msg N

Page 14: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

14

Application Business Layer

Application Data Layer

DataSource

JDBC Driver Connection

Pool

DataSource Interface

Message

Queue

Page 15: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Queues decouple systems: Database is protected from user load

Page 16: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Raise your hand if this sounds familiar

“My next project was to get a working Hadoop setup…

Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy algorithms. “

--Jay Kreps, Kafka PMC

Page 17: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

17

Client Source

Data Pipelines Start like this.

Page 18: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

18

Client Source

Client

Client

Client

Then we reuse them

Page 19: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

19

Client Backend

Client

Client

Client

Then we add consumers to the existing sources

Another Backend

Page 20: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

20

Client Backend

Client

Client

Client

Then it starts to look like this

Another Backend

Another Backend

Another Backend

Page 21: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

21

Client Backend

Client

Client

Client

With maybe some of this

Another Backend

Another Backend

Another Backend

Page 22: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Queues decouple systems: Adding new systems doesn’t require changing Existing systems

Page 23: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

This is where we are trying to get

23

Source System Source System Source System Source System

Kafka decouples Data Pipelines

Hadoop Security Systems Real-time

monitoring Data Warehouse

Kafka

Producers

Brokers

Consumers

Kafka decouples Data Pipelines

Page 24: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Important notes:

• Producers and Consumers dont need to know about each other

• Performance issues on Consumers dont impact Producers

• Consumers are protected from herds of Producers

• Lots of flexibility in handling load

• Messages are available for anyone – lots of new use cases, monitoring, audit, troubleshooting

http://www.slideshare.net/gwenshap/queues-pools-caches

Page 25: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

That’s nice, but what is Kafka?

Page 26: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Kafka provides a fast, distributed, highly scalable, highly available, publish-subscribe messaging system. In turn this solves part of a much harder problem: Communication and integration between components of large software systems

Click to enter confidentiality information

Page 27: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

The Basics

•Messages are organized into topics

•Producers push messages

•Consumers pull messages

•Kafka runs in a cluster. Nodes are called brokers

Page 28: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Topics, Partitions and Logs

Page 29: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Each partition is a log

Page 30: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Each Broker has many partitions

Partition 0 Partition 0

Partition 1 Partition 1

Partition 2

Partition 1

Partition 0

Partition 2 Partion 2

Page 31: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

Page 32: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

Page 33: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Consumers

Consumer Group Y

Consumer Group X

Consumer

Kafka Cluster

Topic

Partition A (File)

Partition B (File)

Partition C (File)

Consumer

Consumer

Consumer

Order retained with in partition

Order retained with in partition but not over

partitions O

ff S

et X

Off

Set

X

Off

Set

X

Off

Set

Y

Off

Set

Y

Off

Set

Y

Off sets are kept per consumer group

Page 34: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Kafka “Magic” – Why is it so fast?

• 250M Events per sec on one node at 3ms latency

• Scales to any number of consumers

• Stores data for set amount of time – Without tracking who read what data

• Replicates – but no need to sync to disk

• Zero-copy writes from memory / disk to network

Page 35: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

How do people use Kafka?

• As a message bus

• As a buffer for replication systems (Like AdvancedQueue in Streams)

• As reliable feed for event processing

• As a buffer for event processing

• Decouple apps from database (both OLTP and DWH)

Page 36: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

My Favorite Use Cases

• Shops consume inventory updates

• Clicking around an online shop? Your clicks go to Kafka and recommendations come back.

• Flagging credit card transactions as fraudulant

• Flagging game interactions as abuse

• Least favorite: Surge pricing in Uber

• Huge list of users at kafka.apache.org

36

Page 37: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Got it! But what about real-time ETL?

Page 38: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Remember This?

38

Source System Source System Source System Source System

Kafka decouples Data Pipelines

Hadoop Security Systems Real-time

monitoring Data Warehouse

Kafka

Producers

Brokers

Consumers

Kafka is smack in middle of all Data Pipelines

Page 39: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

If data flies into Kafka in real time Why wait 24h before pulling it into a DWH?

39

Page 40: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

40

Page 41: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Why Kafka makes real-time ETL better?

• Can integrate with any data source • RDBMS, NoSQL, Applications, web applications, logs

• Consumers can be real-time But they do not have to

• Reading and writing to/from Kafka is cheap • So this is a great place to store intermediate state

• You can fix mistakes by rereading some of the data again • Same data in same order

• Adding more pipelines / aggregations has no impact on source systems = low risk

41

Page 42: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

It is all valuable data

Raw data

Raw data Clean data

Aggregated data Clean data Enriched data

Filtered data Dash board

Report

Data scientist

Alerts OMG

Page 43: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

• However you want: • You just consume data, modify it, and produce it back

• Built into Kafka: • Kprocessor

• Kstream

• Popular choices: • Storm

• SparkStreaming

But wait, how do we process the data?

Page 44: Introduction to Apache Kafka And Real-Time · PDF fileApache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical

Need More Kafka?

• https://kafka.apache.org/documentation.html

• My video tutorial: http://shop.oreilly.com/product/0636920038603.do

• http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/

• Our website: http://confluent.io

• Oracle guide to real-time ETL: http://www.oracle.com/technetwork/middleware/data-integrator/overview/best-practices-for-realtime-data-wa-132882.pdf