gobbin config-meetup-june-2016

Post on 09-Jan-2017

154 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Min Tu Pradhan Cadabam

Gobblin ConfigurationManagementGobblin Meetup June 2016

1. Current Solutions and Motivation – Why we

built Gobblin config?

2. Architecture – Gobblin config internals

3. Retention Example – How retention is

configured using Gobblin config?

Agenda

1. Current Solutions and Motivation – Why we

built Gobblin config?

2. Architecture – Gobblin config internals

3. Retention Example – How retention is

configured using Gobblin config?

Agenda

Job Configs Vs. Dataset Configs

Copy Job

- Permission for loginEvent 700- Permission for logoutEvent 777

Option 1 : One job per dataset- Too many jobs- Long whitelist- Difficult to maintain

Option 2 : Prefix- Too many configs- Can not have single config for

all datasets with same permissions

/events/loginEvent/events/logoutEvent

/events/loginEvent - 700/events/logoutEvent - 777

Source Destination

Copy Job 1 Copy Job 2

dest.permission = 700whitelist = loginEvent

dest.permission = 777whitelist = logoutEvent

loginEvent.dest.permission = 700logoutEvent.dest.permission = 777

Copy Job with prefix

Data Life Cycle Management Configs

/events/loginEvent_Avro /events/loginEvent_Orc

/events/loginEvent_Orc Retention Job

Conversion JobCopy Job

• Shared configs across jobs

• Destination path of conversion job is source path of copy job

• Retention job works on destination path of copy job

• Dataset needs to be enabled in all jobs

/events/loginEvent_Orc

/events/loginEvent_Orc

Retention Job

Retention Job

Other Motivations

• New version of configs should be deployable

without deploying new binaries

• Should be easy to rollback to previous stable

version of configs

• Config changes should have an audit trail

• Complex value types and substitution resolution

support

1. Current Solutions and Motivation – Why we

built Gobblin config?

2. Architecture – Gobblin config internals

3. Retention Example – How retention is

configured using Gobblin config?

Agenda

At a very high-level, we extend typesafe config with:

• Abstraction of a Config Store

• Config versioning

• Support for logical “import” URIs

• Ability to traverse the ”import” relationships

Dataset Configuration Management

Architecture

Client Application

ConfigClient API

ConfigStore API

HadoopFS

Store

HiveMetaStor

eAdapter

MySQLAdapter

Zookeeper

Adapter…

Data Model

Config Store

Dataset config key (URI):/events/loginEvent

Key1: value1Key2: value2

…KeyM: valueM

Dataset config key (URI):/events

Tag config key(URI):/tags

imports

Imported by

Tag config key(URI):/tags/highPriority

keyA: valueXkeyB: valueY

Implicit import Implicit import

HOCON format

• Support Java Properties file

• Support Json file

• Value substitution

• “+=“ syntax to append elements to arrays, path += "/bin”

• …

gobblin.retention : { selection { timeBased.lookbackTime=3y }}

Using Configs in code

ConfigClient client =

ConfigClient.createConfigClient(VersionStabilityPolicy policy);

Config config = client.getConfig(URI uri);

Collection<URI> imports = client.getImports(URI dataset, boolean recursive);

Collection<URI> importedBy = client.getImportedBy(URI tag, boolean recursive);

Config lifecycle at LinkedIn

Example of a config store on HDFSROOT├── _CONFIG_STORE // contents = latest non-rolled-back version ├── 1.0.53 // version directory├── events│ ├── main.conf│   ├── loginEvent│ │ └── main.conf // configuration file for /events/loginEvent│   │ └── includes.conf // specify import links for /events/loginEvent│   ├── shareEvent│   │ └── includes.conf│   └── clickEvent│   └── includes.conf│└── tags ├── highPriority │ └── main.conf // configuration file for /tags/highPriority    │ └── includes.conf // specify import links for /tags/highPriority ├── blacklist └── 10Days

1. Current Solutions and Motivation – Why we

built Gobblin config?

2. Architecture – Gobblin config internals

3. Retention Example – How retention is

configured using Gobblin config?

Agenda

Retention

├── events   ├── loginEvent   │ ├── 2016-06-20.avro   │ └── 2016-06-25.avro   └── logoutEvent   ├── 2016-05-10.avro   └── 2016-06-10.avro

├── events   ├── loginEvent   │ └── 2016-06-25.avro   └── logoutEvent   └── 2016-06-10.avro

• Deleting data that is not required

• Most common retention policy is to delete data older than some days

Example

• Retention policy of 10 days for loginEvent

• Retention policy of 30 days for logoutEvent

Before Retention After Retention

More complex use cases in Production

• Default retention policy of 30 days for all events

• Retention policy of 10 days for loginEvent

• Blacklist retention for clickEvent

• 3 years retention for high priority events like shareEvent

● “events” is the common parent block for “shareEvent”, “loginEvent”, “logoutEvent”, “clickEvent”

● Each block implicitly imports configs from the parent block, “logoutEvent” implicitly imports “events” (Dashed lines)

● Any block can explicitly import any other block (Solid lines)● A child block overrides any key value pairs specified in the parent block

Retention Config

● “logoutEvent” inherits the default retention of 30 days from implicit import, “events”

logoutEvent 30 Days

● “loginEvent” inherits the default retention of 30 days from implicit import, “events”

● “loginEvent” defines a 10 days policy which overrides the 30 days inherited from “events”

loginEvent 10 Days

● “shareEvent” explicitly imports a high priority tag which has retention of 3 years

● “clickEvent” explicitly imports blacklist tag which disables retention for “clickEvent”

Retention Config for share/clickEvent

├── events│ ├── main.conf // Default 30 Days│   ├── loginEvent│   │ └── main.conf // 10 Days│   ├── shareEvent│   │ └── includes.conf // Import /tags/highPriority│   └── clickEvent│   └── includes.conf // Import /tags/blacklist│└── tags ├── highPriority │ └── main.conf // Define 3 Years retention └── blacklist

HDFS Config store

Retention Config Examples/events/main.conf

gobblin.retention : { dataset : { finder.class=gobblin.data.management.retention.CleanableDatasetFinder pattern="/events/*" } selection { policy.class = gobblin.data.management.SelectBeforeTimeBasedSelectionPolicy timeBased.lookbackTime=30d } version : { finder.class=gobblin.data.management.DateTimeDatasetVersionFinder }}

gobblin.retention : { selection { timeBased.lookbackTime=3y }}

/tags/highPriority/main.conf

Supported Policies

• SelectBeforeTimeBasedSelectionPolicy

• NewestKSelectionPolicy

• DailyDependentHourlyPolicy

• CombineSelectionPolicy

More policies -

http://gobblin.readthedocs.io/en/latest/data-management/Gobblin-Retention/

Future work

• Config stores other than Hdfs based config store

• Improve tooling, validation and UI for config store

deployment

Questions

top related