gobblin meetup-whats new in 0.7

15
Gobblin: What’s new? Vasanth Rajamani Chavdar Botev

Upload: vasanth-rajamani

Post on 09-Jan-2017

82 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Gobblin meetup-whats new in 0.7

Gobblin: What’s new?

Vasanth RajamaniChavdar Botev

Page 2: Gobblin meetup-whats new in 0.7

About

Vasanth RajamaniManager

ETL Infrastructure, LinkedIn

Chavdar BotevTech Lead

ETL Infrastructure, LinkedIn

2

Page 3: Gobblin meetup-whats new in 0.7

3

Gobblin for Data Ingest

Streaming events

OLTP Snapshots

OLTP Changelog

Cloud Services

Kafka

JDBC

REST

SOAP

HDFS

SFTP

Page 4: Gobblin meetup-whats new in 0.7

A Peek in Our Support List: Beyond the Data Ingest

Can you also copy this data onto these other Hadoop clusters?2 Replication

Can you retain data for a period of time and then purge it on an ongoing basis?3 Retention

Can you provide certain datasets in a more optimal format like ORC?4 Optimization

Can you guarantee that the data doesn’t have duplicates?5 Compaction

Can you purge some rows for compliance reasons? Can this be done continuously?6 Compliance

4

When and how often is the data made available?1 Monitoring

Page 5: Gobblin meetup-whats new in 0.7

Beyond Data Ingest

5

Oracle Espresso

Kafka MySQL

Site-facing clusters

External Sources

• Monitoring• Retention• Optimization

Format Layout Compaction

• Auditing• Compliance

ETL Clusters• Monitoring• Retention• Optimization• Auditing• Compliance

Prod Clusters

• Monitoring• Retention• Optimization• Auditing• Compliance

Dev Clusters

HDFS

IngestReplication

Data Load

Page 6: Gobblin meetup-whats new in 0.7

Data Lifecycle Management:The Next Frontier

Page 7: Gobblin meetup-whats new in 0.7

Managing the flow of systems’ data and metadata throughout its life cycle:

from creation and receipt through distribution and maintenance

to deletion.

7

Data Lifecycle Management

Page 8: Gobblin meetup-whats new in 0.7

Hadoop Data Lifecycle Management at LinkedIn

8

Data and metadata

10+K datasets Dataset auto-discovery Ownership across many teams

Systems

Multiple loosely coupled systems Ownership across multiple teams Systems and data evolve independently over time

Page 9: Gobblin meetup-whats new in 0.7

9

Hadoop Data Lifecycle Managementwith Gobblin

Page 10: Gobblin meetup-whats new in 0.7

Datasets

10

Ubiquitous Heterogenous Common

Dataset URI E.g. /data/tracking/<TOPIC>,

/data/databases/<DATABASE>/<TABLE> Metadata

Page 11: Gobblin meetup-whats new in 0.7

Dataset Operators

11

Ingest Replication Retention management Data deduping …

Different implementations possible

Page 12: Gobblin meetup-whats new in 0.7

Metadata

12

Ubiquitous Heterogenous Common

Associated with a Dataset URI Can be represented as a collection of K/V pairs

Metadata in Gobblin:

Input: Dataset configuration Output: Metrics and tracking events

Page 13: Gobblin meetup-whats new in 0.7

Orchestration

13

Dataset operators: independent actors

Ingest unaware of replication and vice versa

Interaction through shared state

Ingest lands dataset in a data directory Replication copies all datasets in the directory Retention runs all datasets in the directory

Datasets and metadata: the common language

Page 14: Gobblin meetup-whats new in 0.7

How About Falcon?

14

Top-down approach Tight coupling: centralized repository for feeds

(datasets) and processes Not designed for multi-tenancy Lack of dataset auto-discovery Lack of policies Inflexible flows

Page 15: Gobblin meetup-whats new in 0.7

Conclusion

15

Data lifecycle management

It’s more than just ingest

Loosely coupled systems

Flexible processing is a must for growth

Dataset-centric processing

Think about datasets, not jobs