building realtim data pipelines with kafka connect and spark streaming

35
BUILDING REALTIME DATA PIPELINES WITH KAFKA CONNECT AND SPARK STREAMING Guozhang Wang Confluent

Upload: guozhang-wang

Post on 10-Feb-2017

523 views

Category:

Engineering


10 download

TRANSCRIPT

Page 1: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

BUILDING REALTIME DATA PIPELINES WITH KAFKA CONNECT AND SPARK STREAMING

Guozhang Wang Confluent

Page 2: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

About Me: Guozhang Wang

• Engineer @ Confluent.

• Apache Kafka Committer, PMC Member.

• Before: Engineer @ LinkedIn, Kafka and Samza.

Page 3: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

What do you REALLY need for Stream Processing?

Page 4: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Spark Streaming! Is that All?

Page 5: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Spark Streaming! Is that All?

Page 6: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Spark Streaming! Is that All?

Page 7: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Data can Comes from / Goes to..

Page 8: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 9: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Real-time Data Integration:

getting data to all the right places

Page 10: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Option #1: One-off Tools

• Tools for each specific data systems

• Examples: • jdbcRDD, Cassandra-Spark connector, etc..

• Sqoop, logstash to Kafka, etc..

Page 11: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Option #2: Kitchen Sink Tools

• Generic point-to-point data copy / ETL tools

• Examples: • Enterprise application integration tools

Page 12: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Option #3: Streaming as Copying

• Use stream processing frameworks to copy data

• Examples: • Spark Streaming: MyRDDWriter (forEachPartition)

• Storm, Samza, Flink, etc..

Page 13: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Real-time Integration: E, T & L

Page 14: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 15: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Example: LinkedIn back in 2010

Page 16: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Example: LinkedIn with Kafka

Apache Kafka

Page 17: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 18: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 19: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Large-scale streaming data import/export for Kafka

Kafka Connect

Page 20: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 21: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 22: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Data Model

Page 23: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Data Model

Page 24: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Parallelism Model

Page 25: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Standalone Execution

Page 26: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Distributed Execution

Page 27: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Distributed Execution

Page 28: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Distributed Execution

Page 29: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Delivery Guarantees• Offsets automatically committed and restored

• On restart: task checks offsets & rewinds

• At least once delivery – flush data, then commit • Exactly once for connectors that support it (e.g. HDFS)

Page 30: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Format Converters• Abstract serialization agnostic to connectors

• Convert between Kafka Connect Data API (Connectors) and serialized bytes

• JSON and Avro currently supported

Page 31: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Connector Developer APIsclass Connector {

abstract void start(props);

abstract void stop();

abstract Class<? extends Task> taskClass();

abstract List<Map<…>> taskConfigs(maxTasks);

}

class Source/SinkTask {

abstract void start(props);

abstract void stop();

abstract List<SourceRecord> poll();

abstract void put(records);

abstract void commit();

}

Page 32: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Kafka Connect & Spark Streaming

Page 33: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Kafka Connect Today• Confluent open source: HDFS, JDBC

• Connector Hub: connectors.confluent.io • Examples: MySQL, MongoDB, Twitter, Solr, S3, MQTT,

Counchbase, Vertica, Cassandra, Elastic Search,

HBase, Kudu, Attunity, JustOne, Striim, Bloomberg ..

• Improved connector control (0.10.0)

Page 34: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

THANK YOU!Guozhang Wang | [email protected] | @guozhangwang

Confluent – Afternoon Break Sponsor for Spark Summit• Jay Kreps – I Heart Logs book signing and giveaway• 3:45pm – 4:15pm in Golden Gate

Kafka Training with Confluent University• Kafka Developer and Operations Courses• Visit www.confluent.io/training

Want more Kafka? • Download Confluent Platform Enterprise (incl. Kafka Connect) at

http://www.confluent.io/product• Apache Kafka 0.10 upgrade documentation at

http://docs.confluent.io/3.0.0/upgrade.html

Page 35: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Separation of Concerns