gobblin: unifying data ingestion for hadoop

19
(1 8) GOBBLIN: UNIFYING DATA INGESTION FOR HADOOP Lin Qiao, Yinan Li , Sahil Takiar, Ziyang Liu, Narasimha Veeramreddy, Min Tu, Ying Dai, Issac Buenrostro, Kapil Surlaker, Shirshanka Das, Chavdar Botev Data Analytics Infrastructure @ LinkedIn

Upload: yinan-li

Post on 09-Jan-2017

1.621 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Gobblin: Unifying Data Ingestion for Hadoop

(18)

GOBBLIN: UNIFYING DATA INGESTION FOR

HADOOP

Lin Qiao, Yinan Li, Sahil Takiar, Ziyang Liu, Narasimha Veeramreddy, Min Tu, Ying Dai, Issac Buenrostro, Kapil Surlaker, Shirshanka Das, Chavdar Botev

Data Analytics Infrastructure @ LinkedIn

Page 2: Gobblin: Unifying Data Ingestion for Hadoop

(18)2

Agenda•Why Gobblin?•Gobblin Overview• Case Studies•Gobblin in Details•Gobblin in Production @ LinkedIn• Future Work•Q&A

Page 3: Gobblin: Unifying Data Ingestion for Hadoop

(18)3

Data Ingestion Challenges @ LinkedIn

BIG engineering and operational COST!

Data Sources

Data Types Operational Pain

Page 4: Gobblin: Unifying Data Ingestion for Hadoop

(18)4

Pre-Gobblin Era

OLTP

Tracking

Snapshot and delta file dumps

Kafka

DatabusChange

s

Pipeline #1

External Partner Data

Pipeline #2

REST

JDBC

SOAP

...

Pipeline #3

Pipeline #4

Pipeline #5

Pipeline #n

Databases (Oracle/Espresso

)

Page 5: Gobblin: Unifying Data Ingestion for Hadoop

(18)5

The Gobblin Era

OLTP

Tracking

Snapshot and delta file dumps

Kafka

DatabusChange

s

External Partner Data

REST

JDBC

SOAP

...

Databases (Oracle/Espresso

)

Page 6: Gobblin: Unifying Data Ingestion for Hadoop

(18)6

RequirementsMulti-platform and Scalability

Support

Rich Source Integration

Centralized State

Management

OperabilityExtensibility Self Service

Page 7: Gobblin: Unifying Data Ingestion for Hadoop

(18)7

Architecture OverviewConstructs for Building Ingestion

Flows

WorkUnit / Task

Execution Runtime

Deployment Mode

state store

compaction

retention mgmt.

monitoring

Standalone

Hadoop MR

Yarn

Source Extractor Converter

Qlty. Chker.

Writer Publisher

Task Executor Task State Tracker

Job Launcher Job Scheduler

Page 8: Gobblin: Unifying Data Ingestion for Hadoop

(18)8

Case Study: Kafka Ingestion

KafkaAvroSource

KafkaAvroExtractor

WorkUnit 1(Topic 1, Partition 1)

KafkaConverter

TimePartitionedAvroWriter

Avro

/kafka/topic/hourly/yyyy/mm/dd/hh/*.avro

Compaction

/kafka/topic/daily/yyyy/mm/dd/*.avro

AuditCountQualityChecker

KafkaAvroExtractor

WorkUnit 2(Topic 1, Partition 2)

KafkaConverter

TimePartitionedAvroWriter

Avro

AuditCountQualityChecker

KafkaAvroExtractor

WorkUnit 3(Topic 1, Partitions 1

& 2)

KafkaConverter

TimePartitionedAvroWriter

Avro

AuditCountQualityChecker

TimePartitionedDataPublisher

Page 9: Gobblin: Unifying Data Ingestion for Hadoop

(18)9

Case Study: Database Ingestion

JdbcSource

JdbcExtractor

WorkUnit 1[2015090512, 2015090514)

ToAvroConverter

SnapshotAvroWriter

Row

/database/table/incremental/snapshot-ts/*.avro

Compaction

/database/table/full/snapshot-ts/*.avro

SchemaCompatibiliy & Count Qlty. Chker

SnapshotDataPublisher

JdbcExtractor

WorkUnit 1[2015090512, 2015090514)

ToAvroConverter

SnapshotAvroWriter

Row

SchemaCompatibiliy & Count Qlty. Chker

JdbcExtractor

WorkUnit 1[2015090512, 2015090514)

ToAvroConverter

SnapshotAvroWriter

Row

SchemaCompatibiliy & Count Qlty. Chker

Page 10: Gobblin: Unifying Data Ingestion for Hadoop

(18)10

Case Study – Filtering Sensitive Data

Has Sensitive

Data?no

Source

Extractor

WorkUnit

Converter and Quality Checker

Fork and Branching

Writer

DataPublisher

Writer

Sensitive DataFiltering

Converter

yes

Page 11: Gobblin: Unifying Data Ingestion for Hadoop

(18)11

Data Quality Checking

Record-level Policies

WriterTask-level

Policies

Publisher

Quarantine

Fail Task

Quality Checkers- Per record or per

task.- Policy driven- Composable

~ Schema compatibility

~ Audit check~ Sensitive fields~ Required fields~ Unique key

Page 12: Gobblin: Unifying Data Ingestion for Hadoop

(18)12

State and Metadata Mgmt.

State Store- Stores runtime metadata, e.g.,

checkpoints (a.k.a. watermarks)~ Carried over between job runs

- Default impl: serializes job/task states into files, one per run.

- Allows other implementations that conform to the interface to be plugged in.

State Store

job run #2 job run

#3job run

#1 SEP2

SEP3

SEP2 SEP

3

EXAMPLE

Page 13: Gobblin: Unifying Data Ingestion for Hadoop

(18)13

Metrics / Events and Alerting

KafkaMetricConte

xt

Topic 1MetricConte

xt

Topic 2MetricConte

xt

Partition 1MetricConte

xt

Partition 2MetricConte

xt

20

12 8

6 6

MetricReporte

r

EventReporte

rMetrics / Events Collection and

Reporting- Metrics for ingestion progress

~ supports tagging~ real-time

aggregation- Events for major

milestones~ “fire-and-forget”

- Various built-in metric / event reporters

Page 14: Gobblin: Unifying Data Ingestion for Hadoop

(18)14

Running Modes

Standalone

Runs in a single JVM; tasks run in a thread pool.

Scale-out with MapReduce

Each job run launches a MR job, using mappers as containers to run tasks.

Scale-out with General

Distributed Resource Manager

Supports long-running continuous ingestion, with better resource utilization and SLA guarantees.

YARN

*in progress

Page 15: Gobblin: Unifying Data Ingestion for Hadoop

(18)15

Gobblin in Production @ LinkedIn• In production since 2014

• Usages– Internal sources HDFS

• Kafka, MySQL, Dropbox, etc.– External sources HDFS

• Salesforce, Google Analytics, S3, etc.– HDFS HDFS

• Closed member data purging– Egress from HDFS (future work)

• Data volume– Over a dozen data sources,– thousands of datasets,– tens of TBs,… daily.

Page 16: Gobblin: Unifying Data Ingestion for Hadoop

(18)16

Future Work• Gobblin on Yarn (alpha-release)• Real-time elastic ingestion• Integration with– Apache Sqoop: using Sqoop

connectors– Logstash: log ingestion–Morphlines: using Morphline

transformation– Apache Spark

Page 17: Gobblin: Unifying Data Ingestion for Hadoop

(18)17

ConclusionsPain of

maintaining multiple ingestion pipelines

Gobblin to the rescue!

Data quality assurance and

centralized state

management

Gobblin in production for a wide

range of data sources

Continuous real-time ingestion

Page 18: Gobblin: Unifying Data Ingestion for Hadoop

(18)18

ACKNOWLEDGEMENT

Pradhan CadabamShrikanth ShankarSuvodeep PyneRay OrtigasHenry CaiKenneth GoodhopeErik Krogen

Page 19: Gobblin: Unifying Data Ingestion for Hadoop

(18)19

Thanks.

Github https://github.com/linkedin/gobblinDocumentation https://github.com/linkedin/gobblin/wikiUser Group https://groups.google.com/forum/#!forum/

gobblin-users