linkedin's logical data access layer for hadoop -- strata london 2016

10 Clusters1000 Users

10,000 FlowsThe Dali Experience at LinkedIn

Carl SteinbachSenior Staff Software EngineerLinkedIn Data Analytics Infrastructure Groupin/carlsteinbach@cwsteinbach

https://www.linkedin.com/in/carlsteinbach



https://twitter.com/cwsteinbach

https://twitter.com/cwsteinbach

Hadoop @ LinkedIn: Circa 20081 cluster20 nodes10 users10 production workflowsMapReduce, Pig

Hadoop @ LinkedIn: NOW> 10 clusters> 10,000 nodes> 1,000 users

Hundreds of production workflows, thousands of development flows and ad-hoc QsMapReduce, Pig, Hive, Gobblin, Cubert, Scalding, Spark, Presto, …

What did we learn along the way?

Scaling Hardware Infrastructure is Hard

What did we learn along the way?

Scaling Human Infrastructure is Harder

6

Hidden,constantlyevolvingdependenciesbindingproducers,consumers,andinfraproviders

Motivations: Producers, ConsumersData Consumers have to manage too many details:• Where is the data located? (cluster, path)• How is the data partitioned? (logical physical mapping)• How do I read the data? (storage format, wire protocol)

Data Producers are flying blind:• Who is consuming the data that I produce?• Will anything break if I make this change?• Deprecating legacy schemas is too expensive.

Motivations: Infra ProvidersThis mess makes things really hard for infrastructure providers!

Lots of optimizations are impossible because producers/consumer logic locks us into what should be backend decisions• Storage format• Physical partitioning scheme• Data location, wire protocol

Lots of redundant code paths to support: Spark, Hive, Presto, Pig, etc

Dali Vision and MissionMotivation:

Make analytics infrastructure invisible by abstracting away the underlying physical details

Mission: Make data on HDFS easier to access + manage Filesystem: protocol-independent, multi-cluster Datasets: tables not files Views: virtual datasets, contract management for producers and consumers Lineage and Discovery: map datasets to producers, consumers, and track

changes over time

Dali Dataset API: Catalog Service

Is a Dataset API Enough?Some use cases at LinkedIn: Structural transformations (flattening and nesting) Muxing and de-muxing data (unions) Patching bad data Backward incompatible changes (intentional and otherwise…) Code reuse

What we need: Ability to decouple the API from the dataset Producer control over public and private APIs Tooling and processes to support safe evolution of these APIs

Dali View

A sample view

CREATE VIEW profile_flattened TBLPROPERTIES( 'functions' = 'get_profile_section:isb.GetProfileSections', 'dependencies' = 'com.linkedin.dali-udfs:get-profile-sections:0.0.5')AS SELECT get_profile_section(...)FROM prod_identity.profile;

Reading a Dali View from Pig

register ivy://com.linkedin.dali:dali-all:2.3.52;

data = LOAD ‘dalids:///tracking.pageviewevent’ USING DaliStorage();

data = FILTER data BY datepartition >= ‘2016-05-08-00’ AND datepartition <= ‘2016-05-10-00’

View Versioning• Views can evolve to add/remove fields, update UDF and views/table

dependencies, update logic, etc.• Multiple versions of each view can be registered with the Dali at the same time.• Consumers can migrate to newer versions at their own pace.• Incremental upgrades reduce the cost and risk of change!

Example:For a database foo which contains view bar, we could have:bar_1_0_0, bar_1_1_0, bar_2_0_0 registered with Dali at the same time.

* We also register bar which is a latest pointer to bar_2_0_0

Semantic Versioning for ViewsMajor Version• Backward incompatible changes to the view schema

• Removing a field• Changing the physical type of an existing field

Minor Version• Backward compatible changes visible to consumers of the view

• Adding a new field to the schema

Patch Version• Everything else that doesn’t alter the schema or semantic output of the view

• Updating one of the view’s binary dependencies• Updating SQL for better execution plan

Leveraging existing LI tools INFRA

Query view/UDF version dependency graphwho-depends-on-me?

Deprecate, EOL, and purge a specific view/UDF version

Plug into existing global namespace management provided by LI developer toolsEnforce referential integrity for views at deployment time

Contract Law for DatasetsVague, poorly defined contracts bind data producers to consumers Physical types don’t tell us much

STRING or URI? STRING or ENUM?

Semantic types help, but what about other types of relationships? X IS NOT NULL A_time is in seconds, b_time is in millis

Attributes of a good contract Easy to find Easy to understand Easy to change

Hijacking an existing processExpress contracts as logical constraints against the fields of a viewMake the contract easy to find by storing it in the view’s Git repo

Contract negotiation follows an existing process Data producer (view owner) controls the ACL on the view repo Data consumer requests a contract change via ReviewBoard request View owner either accepts or rejects the pull request

If accepted, view version is bumped to notify downstream consumers If rejected, consumer still has the option of committing the constraint to their own repo

Contract Constraint based testing for viewsContract Data Quality tests

Case Study: Project VoyagerViews allowed us to parallelize development by decoupling the online and offline sides of the project.• Read existing data using new

schemas• Legacy apps can continue using old

schemas

~ 100 views for the Voyager project• 31 consumer (leaf) views• 63 producer views• Dependencies on 48 unique tables

Why Dali?Consumers Make data stable, predictable, discoverable

Producers Explicit, manageable contracts with consumers Frictionless, familiar process for modifying existing contracts

Infra Providers Freedom to optimize Flow portability DR, multi-DC scheduling

Simplifying with Views

[email protected]/in/carlsteinbach@cwsteinbach

mailto:[email protected]

linkedin's logical data access layer for hadoop -- strata london 2016

Data & Analytics