apache gobblin: bridging batch and streaming data integration. big data meetup @ linkedin apr 2017

39
The Data Driven Network Kapil Surlaker Director of Engineering Bridging Batch and Streaming Data Integration with Gobblin Shirshanka Das Gobblin team 26th Apr, 2017 Big Data Meetup github.com/linkedin/gobblin @ApacheGobblin gitter.im/gobblin

Upload: shirshanka-das

Post on 21-Jan-2018

4.636 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

The Data Driven Network

Kapil Surlaker Director of Engineering

Bridging Batch and Streaming Data Integration with Gobblin

Shirshanka Das

Gobblin team 26th Apr, 2017

Big Data Meetup

github.com/linkedin/gobblin @ApacheGobblin gitter.im/gobblin

Page 2: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Data Integration: key requirements

Source, Sink Diversity

Batch +

StreamingData

Quality

So, we built

Page 3: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

SFTP

JDBCREST

Simplifying Data Integration

@LinkedIn

Hundreds of TB per day

Thousands of datasets

~30 different source systems

80%+ of data ingest

Open source @ github.com/linkedin/gobblin

Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, CERN, NerdWallet and many more…

Apache incubation under way

SFTP

Azure StorageAzure

Storage

Page 4: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

4

Other Open Source Systems in this Space

Sqoop, Flume, Falcon, Nifi, Kafka Connect

Flink, Spark, Samza, Apex

Similar in pieces, dissimilar in aggregate Most are tied to a specific execution model (batch / stream) Most are tied to a specific implementation, ecosystem (Kafka, Hadoop etc)

Page 5: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

: Under the Hood

5

Page 6: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

6

Gobblin: The Logical Pipeline

Page 7: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

7

WorkUnit A logical unit of work, typically bounded but not necessary.

Kafka Topic: LoginEvent, Partition: 10, Offsets: 10-200

HDFS Folder: /data/Login, File: part-0.avro

Hive Dataset: Tracking.Login, date-partition=mm-dd-yy-hh

Page 8: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

8

Source: A provider of WorkUnits

(typically a system like Kafka, HDFS etc.)

Page 9: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

9

Task: A unit of execution that operates on a WorkUnit

Extracts records from the source, writes to the destination

Ends when WorkUnit is exhausted of records

(assigned to Thread in ThreadPool, Mapper in Map-Reduce etc.)

Page 10: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

10

Extractor: A provider of records given a WorkUnit

Connects to Data Source

Deserializer of records

Page 11: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

11

Converter: A 1:N mapper of input records to output records

Multiple converters can be chained

(e.g. Avro <-> JSON, Schema project, Encrypt)

Page 12: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

12

Quality Checker: Can check if the quality of the output is

satisfactory

Row-level (e.g. time value check)

Task-level (e.g. audit check, schema compatibility)

Page 13: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

13

Writer: Writes to the destination

Connection to the destination, Serializer of records

Sync / Async

e.g. FsWriter, KafkaWriter, CouchbaseWriter

Page 14: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

14

Publisher: Finalizes / Commits the data

Used for destinations that support atomicity

(e.g. move tmp staging directory to final

output directory on HDFS)

Page 15: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

15

Gobblin: The Logical Pipeline

Page 16: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

16

State Store (HDFS, S3, MySQL, ZK, …)

Load config previous watermarks

save watermarks

Gobblin: The Logical PipelineStateful

^

Page 17: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

: Pipeline Specification

17

Page 18: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Gobblin: Pipeline Specification

job.name=PullFromWikipediajob.group=Wikipediajob.description=AgettingstartedexampleforGobblin

source.class=gobblin.example.wikipedia.WikipediaSourcesource.page.titles=LinkedIn,Wikipedia:Sandboxsource.revisions.cnt=5

wikipedia.api.rooturl=https://en.wikipedia.org/w/api.phpwikipedia.avro.schema={"namespace":“example.wikipedia.avro”,…"null"]}]}gobblin.wikipediaSource.maxRevisionsPerPage=10

converter.classes=gobblin.example.wikipedia.WikipediaConverter

Pipeline Name, Description

Source + configuration

Page 19: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

source.revisions.cnt=5

wikipedia.api.rooturl=https://en.wikipedia.org/w/api.phpwikipedia.avro.schema={"namespace":“example.wikipedia.avro”,…"null"]}]}gobblin.wikipediaSource.maxRevisionsPerPage=10

converter.classes=gobblin.example.wikipedia.WikipediaConverter

extract.namespace=gobblin.example.wikipedia

writer.destination.type=HDFSwriter.output.format=AVROwriter.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner

data.publisher.type=gobblin.publisher.BaseDataPublisher

Gobblin: Pipeline Specification

Converter

Writer + configuration

Page 20: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

converter.classes=gobblin.example.wikipedia.WikipediaConverter

extract.namespace=gobblin.example.wikipedia

writer.destination.type=HDFSwriter.output.format=AVROwriter.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner

data.publisher.type=gobblin.publisher.BaseDataPublisher

Gobblin: Pipeline Specification

Publisher

Page 21: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Gobblin: Pipeline Deployment

Bare Metal / AWS / Azure / VM

Standalone: Single Instance

Small Medium Large

AWS (EC2) Hadoop (YARN / MR)Standalone Cluster

Pipeline Specification

Static Cluster Elastic ClusterOne Box

One Spec Multiple Environments

Page 22: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Execution Model: Batch versus Streaming

Batch Determine work, Acquire slots, Run, Checkpoint, Repeat

+ Cost-efficient, deterministic, repeatable - Higher latency

- Setup, Checkpoint costs dominate if “micro-batching”

Page 23: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Execution Model: Batch versus Streaming

Streaming Determine work streams, Run continuously, Checkpoint periodically

+ Low latency

- Higher-cost because it is harder to provision accurately

- More sophistication needed to deal with change

Page 24: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Batch

Execution Model Scorecard

Batch

Streaming

Streaming

Streaming

Streaming

Batch

Batch

JDBC <->HDFS Kafka ->HDFS

HDFS ->Kafka Kafka <->Kinesis

Page 25: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Can we run in both models using the same system?

Page 26: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

26

Gobblin: The Logical Pipeline

Page 27: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

27

Batch Determine work

Streaming Determine work

- unbounded WorkUnit

Pipeline Stages: Start

Page 28: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

28

Batch Acquire slots, Run

Streaming Run continuously

Checkpoint periodically

Shutdown gracefully

Pipeline Stages: Run

Watermark Manager

State Storage

notify ack

shutdown

Page 29: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

29

Batch Checkpoint, Commit

Streaming Do nothing

- NoOpPublisher

Pipeline Stages: End

Page 30: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Enabling Streaming mode

task.executionMode = streaming

Standalone: Single Instance

AWS Hadoop (YARN / MR)Standalone Cluster

Page 31: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

A Streaming Pipeline Spec: Kafka 2 Kafka

# A sample pull file that copies an input Kafka topic and # produces to an output Kafka topic with sampling

job.name=Kafka2KafkaStreaming job.group=Kafka job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic

job.lock.enabled=false

source.class=gobblin.source….KafkaSimpleStreamingSource

Pipeline Name, Description

Page 32: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic

job.lock.enabled=false

source.class=gobblin.source….KafkaSimpleStreamingSource

gobblin.streaming.kafka.topic.key.deserializer=org.apache.kafka.common.serialization.StringDeserializer gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092

# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter

Source, configuration

A Streaming Pipeline Spec: Kafka 2 Kafka

Page 33: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092

# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10

writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer

data.publisher.type=gobblin.publisher.NoopPublisher

A Streaming Pipeline Spec: Kafka 2 Kafka

Converter, configuration

Page 34: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10

writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer

data.publisher.type=gobblin.publisher.NoopPublisher

task.executionMode=STREAMING

A Streaming Pipeline Spec: Kafka 2 Kafka

Writer, configuration

Publisher

Page 35: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

data.publisher.type=gobblin.publisher.NoopPublisher

task.executionMode=STREAMING

# Configure watermark storage for streaming #streaming.watermarkStateStore.type=zk #streaming.watermarkStateStore.config.state.store.zk.connectString=localhost:2181 # Configure watermark commit settings for streaming #streaming.watermark.commitIntervalMillis=2000

A Streaming Pipeline Spec: Kafka 2 Kafka

Execution Mode, watermark storage configuration

Page 36: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Gobblin Streaming: Cluster view

Cluster of processes

Apache Helix: work-unit assignment, fault-tolerance, reassignment Cluster

Master

Helix

Worker 1

Worker 2

Worker 3

Sink (Kafka, HDFS, …)

Stream Source

Page 37: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Active Workstreams in Gobblin

Gobblin as a Service Global orchestrator with REST API for submitting logical flow specifications Logical flow specifications compile down to physical pipeline specs

Global Throttling Throttling capability to ensure Gobblin respects quotas globally (e.g. api calls, network b/w, Hadoop namenode etc.) Generic: can be used outside Gobblin

Metadata driven Integration with Metadata Service (c.f. WhereHows) Policy driven replication, permissions, encryption etc.

Page 38: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Roadmap

Final LinkedIn Gobblin 0.10.0 release Apache Incubator code donation and release More Streaming runtimes

Integration with Apache Samza, LinkedIn Brooklin GDPR Compliance: Data purge for Hadoop and other systems

Security improvements Credential storage, Secure specs

Page 39: Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

39

Gobblin Team @ LinkedIn