gobblin meetup-whats new in 0.7

Gobblin: What’s new?

Vasanth RajamaniChavdar Botev

Vasanth RajamaniManager

ETL Infrastructure, LinkedIn

Chavdar BotevTech Lead

ETL Infrastructure, LinkedIn

Gobblin for Data Ingest

Streaming events

OLTP Snapshots

OLTP Changelog

Cloud Services

A Peek in Our Support List: Beyond the Data Ingest

Can you also copy this data onto these other Hadoop clusters?2 Replication

Can you retain data for a period of time and then purge it on an ongoing basis?3 Retention

Can you provide certain datasets in a more optimal format like ORC?4 Optimization

Can you guarantee that the data doesn’t have duplicates?5 Compaction

Can you purge some rows for compliance reasons? Can this be done continuously?6 Compliance

When and how often is the data made available?1 Monitoring

Beyond Data Ingest

Oracle Espresso

Kafka MySQL

Site-facing clusters

External Sources

• Monitoring• Retention• Optimization

Format Layout Compaction

• Auditing• Compliance

ETL Clusters• Monitoring• Retention• Optimization• Auditing• Compliance

Prod Clusters

• Monitoring• Retention• Optimization• Auditing• Compliance

Dev Clusters

IngestReplication

Data Load

Data Lifecycle Management:The Next Frontier

Managing the flow of systems’ data and metadata throughout its life cycle:

from creation and receipt through distribution and maintenance

to deletion.

Data Lifecycle Management

Hadoop Data Lifecycle Management at LinkedIn

Data and metadata

10+K datasets Dataset auto-discovery Ownership across many teams

Systems

Multiple loosely coupled systems Ownership across multiple teams Systems and data evolve independently over time

Hadoop Data Lifecycle Managementwith Gobblin

Datasets

Ubiquitous Heterogenous Common

Dataset URI E.g. /data/tracking/<TOPIC>,

/data/databases/<DATABASE>/<TABLE> Metadata

Dataset Operators

Ingest Replication Retention management Data deduping …

Different implementations possible

Metadata

Ubiquitous Heterogenous Common

Associated with a Dataset URI Can be represented as a collection of K/V pairs

Metadata in Gobblin:

Input: Dataset configuration Output: Metrics and tracking events

Orchestration

Dataset operators: independent actors

Ingest unaware of replication and vice versa

Interaction through shared state

Ingest lands dataset in a data directory Replication copies all datasets in the directory Retention runs all datasets in the directory

Datasets and metadata: the common language

How About Falcon?

Top-down approach Tight coupling: centralized repository for feeds

(datasets) and processes Not designed for multi-tenancy Lack of dataset auto-discovery Lack of policies Inflexible flows

Conclusion

Data lifecycle management

It’s more than just ingest

Loosely coupled systems

Flexible processing is a must for growth

Dataset-centric processing

Think about datasets, not jobs

gobblin meetup-whats new in 0.7

Software

gobblin: unifying data ingestion for hadoop

whats cookin

whats ooo???

whats shak'n

whats newewrqwetwqtwqtqwerweqrwqe

whats newsolman7.1

whats wrong

whats good

whats new whats it do

microbicide update: whats new, whats next? christine mauck,...

whats funding

whats next

apache gobblin: bridging batch and streaming data...

whats new in titanium 0.7

java 7 whats new(), whats next() from oredev

whats burning

whats science

whats this

whats hot, whats not skills for sas® professionals...

whats next?