gobblin meetup-whats new in 0.7
TRANSCRIPT
Gobblin: What’s new?
Vasanth RajamaniChavdar Botev
About
Vasanth RajamaniManager
ETL Infrastructure, LinkedIn
Chavdar BotevTech Lead
ETL Infrastructure, LinkedIn
2
3
Gobblin for Data Ingest
Streaming events
OLTP Snapshots
OLTP Changelog
Cloud Services
Kafka
JDBC
REST
SOAP
HDFS
SFTP
A Peek in Our Support List: Beyond the Data Ingest
Can you also copy this data onto these other Hadoop clusters?2 Replication
Can you retain data for a period of time and then purge it on an ongoing basis?3 Retention
Can you provide certain datasets in a more optimal format like ORC?4 Optimization
Can you guarantee that the data doesn’t have duplicates?5 Compaction
Can you purge some rows for compliance reasons? Can this be done continuously?6 Compliance
4
When and how often is the data made available?1 Monitoring
Beyond Data Ingest
5
Oracle Espresso
Kafka MySQL
Site-facing clusters
External Sources
• Monitoring• Retention• Optimization
Format Layout Compaction
• Auditing• Compliance
ETL Clusters• Monitoring• Retention• Optimization• Auditing• Compliance
Prod Clusters
• Monitoring• Retention• Optimization• Auditing• Compliance
Dev Clusters
HDFS
IngestReplication
Data Load
Data Lifecycle Management:The Next Frontier
Managing the flow of systems’ data and metadata throughout its life cycle:
from creation and receipt through distribution and maintenance
to deletion.
7
Data Lifecycle Management
Hadoop Data Lifecycle Management at LinkedIn
8
Data and metadata
10+K datasets Dataset auto-discovery Ownership across many teams
Systems
Multiple loosely coupled systems Ownership across multiple teams Systems and data evolve independently over time
9
Hadoop Data Lifecycle Managementwith Gobblin
Datasets
10
Ubiquitous Heterogenous Common
Dataset URI E.g. /data/tracking/<TOPIC>,
/data/databases/<DATABASE>/<TABLE> Metadata
Dataset Operators
11
Ingest Replication Retention management Data deduping …
Different implementations possible
Metadata
12
Ubiquitous Heterogenous Common
Associated with a Dataset URI Can be represented as a collection of K/V pairs
Metadata in Gobblin:
Input: Dataset configuration Output: Metrics and tracking events
Orchestration
13
Dataset operators: independent actors
Ingest unaware of replication and vice versa
Interaction through shared state
Ingest lands dataset in a data directory Replication copies all datasets in the directory Retention runs all datasets in the directory
Datasets and metadata: the common language
How About Falcon?
14
Top-down approach Tight coupling: centralized repository for feeds
(datasets) and processes Not designed for multi-tenancy Lack of dataset auto-discovery Lack of policies Inflexible flows
Conclusion
15
Data lifecycle management
It’s more than just ingest
Loosely coupled systems
Flexible processing is a must for growth
Dataset-centric processing
Think about datasets, not jobs