apache gobblin: bridging batch and streaming data integration. big data meetup @ linkedin apr 2017

The Data Driven Network

Kapil Surlaker Director of Engineering

Bridging Batch and Streaming Data Integration with Gobblin

Shirshanka Das

Gobblin team 26th Apr, 2017

Big Data Meetup

github.com/linkedin/gobblin @ApacheGobblin gitter.im/gobblin

Data Integration: key requirements

Source, Sink Diversity

Batch +

StreamingData

Quality

So, we built

JDBCREST

Simplifying Data Integration

@LinkedIn

Hundreds of TB per day

Thousands of datasets

~30 different source systems

80%+ of data ingest

Open source @ github.com/linkedin/gobblin

Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, CERN, NerdWallet and many more…

Apache incubation under way

Azure StorageAzure

Storage

Other Open Source Systems in this Space

Sqoop, Flume, Falcon, Nifi, Kafka Connect

Flink, Spark, Samza, Apex

Similar in pieces, dissimilar in aggregate Most are tied to a specific execution model (batch / stream) Most are tied to a specific implementation, ecosystem (Kafka, Hadoop etc)

: Under the Hood

Gobblin: The Logical Pipeline

WorkUnit A logical unit of work, typically bounded but not necessary.

Kafka Topic: LoginEvent, Partition: 10, Offsets: 10-200

HDFS Folder: /data/Login, File: part-0.avro

Hive Dataset: Tracking.Login, date-partition=mm-dd-yy-hh

Source: A provider of WorkUnits

(typically a system like Kafka, HDFS etc.)

Task: A unit of execution that operates on a WorkUnit

Extracts records from the source, writes to the destination

Ends when WorkUnit is exhausted of records

(assigned to Thread in ThreadPool, Mapper in Map-Reduce etc.)

Extractor: A provider of records given a WorkUnit

Connects to Data Source

Deserializer of records

Converter: A 1:N mapper of input records to output records

Multiple converters can be chained

(e.g. Avro <-> JSON, Schema project, Encrypt)

Quality Checker: Can check if the quality of the output is

satisfactory

Row-level (e.g. time value check)

Task-level (e.g. audit check, schema compatibility)

Writer: Writes to the destination

Connection to the destination, Serializer of records

Sync / Async

e.g. FsWriter, KafkaWriter, CouchbaseWriter

Publisher: Finalizes / Commits the data

Used for destinations that support atomicity

(e.g. move tmp staging directory to final

output directory on HDFS)

State Store (HDFS, S3, MySQL, ZK, …)

Load config previous watermarks

save watermarks

Gobblin: The Logical PipelineStateful

: Pipeline Specification

Gobblin: Pipeline Specification

job.name=PullFromWikipediajob.group=Wikipediajob.description=AgettingstartedexampleforGobblin

source.class=gobblin.example.wikipedia.WikipediaSourcesource.page.titles=LinkedIn,Wikipedia:Sandboxsource.revisions.cnt=5

wikipedia.api.rooturl=https://en.wikipedia.org/w/api.phpwikipedia.avro.schema={"namespace":“example.wikipedia.avro”,…"null"]}]}gobblin.wikipediaSource.maxRevisionsPerPage=10

converter.classes=gobblin.example.wikipedia.WikipediaConverter

Pipeline Name, Description

Source + configuration

source.revisions.cnt=5

wikipedia.api.rooturl=https://en.wikipedia.org/w/api.phpwikipedia.avro.schema={"namespace":“example.wikipedia.avro”,…"null"]}]}gobblin.wikipediaSource.maxRevisionsPerPage=10

extract.namespace=gobblin.example.wikipedia

writer.destination.type=HDFSwriter.output.format=AVROwriter.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner

data.publisher.type=gobblin.publisher.BaseDataPublisher

Converter

Writer + configuration

extract.namespace=gobblin.example.wikipedia

writer.destination.type=HDFSwriter.output.format=AVROwriter.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner

data.publisher.type=gobblin.publisher.BaseDataPublisher

Publisher

Gobblin: Pipeline Deployment

Bare Metal / AWS / Azure / VM

Standalone: Single Instance

Small Medium Large

AWS (EC2) Hadoop (YARN / MR)Standalone Cluster

Pipeline Specification

Static Cluster Elastic ClusterOne Box

One Spec Multiple Environments

Execution Model: Batch versus Streaming

Batch Determine work, Acquire slots, Run, Checkpoint, Repeat

+ Cost-efficient, deterministic, repeatable - Higher latency

- Setup, Checkpoint costs dominate if “micro-batching”

Execution Model: Batch versus Streaming

Streaming Determine work streams, Run continuously, Checkpoint periodically

+ Low latency

- Higher-cost because it is harder to provision accurately

- More sophistication needed to deal with change

Execution Model Scorecard

Streaming

JDBC <->HDFS Kafka ->HDFS

HDFS ->Kafka Kafka <->Kinesis

Can we run in both models using the same system?

Batch Determine work

Streaming Determine work

- unbounded WorkUnit

Pipeline Stages: Start

Batch Acquire slots, Run

Streaming Run continuously

Checkpoint periodically

Shutdown gracefully

Pipeline Stages: Run

Watermark Manager

State Storage

notify ack

shutdown

Batch Checkpoint, Commit

Streaming Do nothing

- NoOpPublisher

Pipeline Stages: End

Enabling Streaming mode

task.executionMode = streaming

Standalone: Single Instance

AWS Hadoop (YARN / MR)Standalone Cluster

A Streaming Pipeline Spec: Kafka 2 Kafka

# A sample pull file that copies an input Kafka topic and # produces to an output Kafka topic with sampling

job.name=Kafka2KafkaStreaming job.group=Kafka job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic

job.lock.enabled=false

source.class=gobblin.source….KafkaSimpleStreamingSource

Pipeline Name, Description

job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic

job.lock.enabled=false

source.class=gobblin.source….KafkaSimpleStreamingSource

gobblin.streaming.kafka.topic.key.deserializer=org.apache.kafka.common.serialization.StringDeserializer gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092

# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter

Source, configuration

gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092

# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10

writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer

data.publisher.type=gobblin.publisher.NoopPublisher

Converter, configuration

# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10

writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer

task.executionMode=STREAMING

Writer, configuration

Publisher

task.executionMode=STREAMING

# Configure watermark storage for streaming #streaming.watermarkStateStore.type=zk #streaming.watermarkStateStore.config.state.store.zk.connectString=localhost:2181 # Configure watermark commit settings for streaming #streaming.watermark.commitIntervalMillis=2000

Execution Mode, watermark storage configuration

Gobblin Streaming: Cluster view

Cluster of processes

Apache Helix: work-unit assignment, fault-tolerance, reassignment Cluster

Master

Worker 1

Worker 2

Worker 3

Sink (Kafka, HDFS, …)

Stream Source

Active Workstreams in Gobblin

Gobblin as a Service Global orchestrator with REST API for submitting logical flow specifications Logical flow specifications compile down to physical pipeline specs

Global Throttling Throttling capability to ensure Gobblin respects quotas globally (e.g. api calls, network b/w, Hadoop namenode etc.) Generic: can be used outside Gobblin

Metadata driven Integration with Metadata Service (c.f. WhereHows) Policy driven replication, permissions, encryption etc.

Roadmap

Final LinkedIn Gobblin 0.10.0 release Apache Incubator code donation and release More Streaming runtimes

Integration with Apache Samza, LinkedIn Brooklin GDPR Compliance: Data purge for Hadoop and other systems

Security improvements Credential storage, Secure specs

Gobblin Team @ LinkedIn

apache gobblin: bridging batch and streaming data integration. big data meetup @ linkedin apr 2017

Data & Analytics

data philly meetup - big (geo) data

flinkml - big data application meetup

geo-social data - vienna data science meetup

cbd-pentaho big data - meetup

data portability - fronteers meetup

data for good fr - meetup #1

data meetup presentation

command line data science - meetup

tech meetup data driven - codemotion

dallas data brewery meetup #2: data quality perception

aua data science meetup

data organization: hive meetup

gobblin on-aws

velox at sf data mining meetup

iswc 2012 - linked data meetup

big data meetup

data science warsaw inaugural meetup

big data inference and security...

meetup intro techno big data

©2014 linkedin corporation. all rights reserved. gobblin’...