gobblin meetup-whats new in 0.7

Gobblin: What’s new?

Vasanth RajamaniChavdar Botev

About

Vasanth RajamaniManager

ETL Infrastructure, LinkedIn

Chavdar BotevTech Lead

ETL Infrastructure, LinkedIn

2

3

Gobblin for Data Ingest

Streaming events

OLTP Snapshots

OLTP Changelog

Cloud Services

Kafka

JDBC

REST

SOAP

HDFS

SFTP

A Peek in Our Support List: Beyond the Data Ingest

Can you also copy this data onto these other Hadoop clusters?2 Replication

Can you retain data for a period of time and then purge it on an ongoing basis?3 Retention

Can you provide certain datasets in a more optimal format like ORC?4 Optimization

Can you guarantee that the data doesn’t have duplicates?5 Compaction

Can you purge some rows for compliance reasons? Can this be done continuously?6 Compliance

4

When and how often is the data made available?1 Monitoring

Beyond Data Ingest

5

Oracle Espresso

Kafka MySQL

Site-facing clusters

External Sources

• Monitoring• Retention• Optimization

Format Layout Compaction

• Auditing• Compliance

ETL Clusters• Monitoring• Retention• Optimization• Auditing• Compliance

Prod Clusters

• Monitoring• Retention• Optimization• Auditing• Compliance

Dev Clusters

HDFS

IngestReplication

Data Load

Data Lifecycle Management:The Next Frontier

Managing the flow of systems’ data and metadata throughout its life cycle:

from creation and receipt through distribution and maintenance

to deletion.

7

Data Lifecycle Management

Hadoop Data Lifecycle Management at LinkedIn

8

Data and metadata

10+K datasets Dataset auto-discovery Ownership across many teams

Systems

Multiple loosely coupled systems Ownership across multiple teams Systems and data evolve independently over time

9

Hadoop Data Lifecycle Managementwith Gobblin

Datasets

10

Ubiquitous Heterogenous Common

Dataset URI E.g. /data/tracking/<TOPIC>,

/data/databases/<DATABASE>/<TABLE> Metadata

Dataset Operators

11

Ingest Replication Retention management Data deduping …

Different implementations possible

Metadata

12

Ubiquitous Heterogenous Common

Associated with a Dataset URI Can be represented as a collection of K/V pairs

Metadata in Gobblin:

Input: Dataset configuration Output: Metrics and tracking events

Orchestration

13

Dataset operators: independent actors

Ingest unaware of replication and vice versa

Interaction through shared state

Ingest lands dataset in a data directory Replication copies all datasets in the directory Retention runs all datasets in the directory

Datasets and metadata: the common language

How About Falcon?

14

Top-down approach Tight coupling: centralized repository for feeds

(datasets) and processes Not designed for multi-tenancy Lack of dataset auto-discovery Lack of policies Inflexible flows

Conclusion

15

Data lifecycle management

It’s more than just ingest

Loosely coupled systems

Flexible processing is a must for growth

Dataset-centric processing

Think about datasets, not jobs

gobblin meetup-whats new in 0.7

Software