linkedin's logical data access layer for hadoop -- strata london 2016

24
10 Clusters 1000 Users 10,000 Flows The Dali Experience at LinkedIn Carl Steinbach Senior Staff Software Engineer LinkedIn Data Analytics Infrastructure Group i n/ carlsteinbach @ cwsteinbach

Upload: carl-steinbach

Post on 16-Apr-2017

816 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

10 Clusters1000 Users

10,000 FlowsThe Dali Experience at LinkedIn

Carl SteinbachSenior Staff Software EngineerLinkedIn Data Analytics Infrastructure Groupin/carlsteinbach@cwsteinbach

Page 2: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Hadoop @ LinkedIn: Circa 20081 cluster20 nodes10 users10 production workflowsMapReduce, Pig

Page 3: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Hadoop @ LinkedIn: NOW> 10 clusters> 10,000 nodes> 1,000 users

Hundreds of production workflows, thousands of development flows and ad-hoc QsMapReduce, Pig, Hive, Gobblin, Cubert, Scalding, Spark, Presto, …

Page 4: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

What did we learn along the way?

Scaling Hardware Infrastructure is Hard

Page 5: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

What did we learn along the way?

Scaling Human Infrastructure is Harder

Page 6: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

6

Hidden,constantlyevolvingdependenciesbindingproducers,consumers,andinfraproviders

Page 7: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Motivations: Producers, ConsumersData Consumers have to manage too many details:• Where is the data located? (cluster, path)• How is the data partitioned? (logical physical mapping)• How do I read the data? (storage format, wire protocol)

Data Producers are flying blind:• Who is consuming the data that I produce?• Will anything break if I make this change?• Deprecating legacy schemas is too expensive.

Page 8: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Motivations: Infra ProvidersThis mess makes things really hard for infrastructure providers!

Lots of optimizations are impossible because producers/consumer logic locks us into what should be backend decisions• Storage format• Physical partitioning scheme• Data location, wire protocol

Lots of redundant code paths to support: Spark, Hive, Presto, Pig, etc

Page 9: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Dali Vision and MissionMotivation:

Make analytics infrastructure invisible by abstracting away the underlying physical details

Mission: Make data on HDFS easier to access + manage Filesystem: protocol-independent, multi-cluster Datasets: tables not files Views: virtual datasets, contract management for producers and consumers Lineage and Discovery: map datasets to producers, consumers, and track

changes over time

Page 10: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Dali Dataset API: Catalog Service

Page 11: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Is a Dataset API Enough?Some use cases at LinkedIn: Structural transformations (flattening and nesting) Muxing and de-muxing data (unions) Patching bad data Backward incompatible changes (intentional and otherwise…) Code reuse

What we need: Ability to decouple the API from the dataset Producer control over public and private APIs Tooling and processes to support safe evolution of these APIs

Page 12: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Dali View

Page 13: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

A sample view

CREATE VIEW profile_flattened TBLPROPERTIES( 'functions' = 'get_profile_section:isb.GetProfileSections', 'dependencies' = 'com.linkedin.dali-udfs:get-profile-sections:0.0.5')AS SELECT get_profile_section(...)FROM prod_identity.profile;

Page 14: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Reading a Dali View from Pig

register ivy://com.linkedin.dali:dali-all:2.3.52;

data = LOAD ‘dalids:///tracking.pageviewevent’ USING DaliStorage();

data = FILTER data BY datepartition >= ‘2016-05-08-00’ AND datepartition <= ‘2016-05-10-00’

Page 15: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

View Versioning• Views can evolve to add/remove fields, update UDF and views/table

dependencies, update logic, etc.• Multiple versions of each view can be registered with the Dali at the same time.• Consumers can migrate to newer versions at their own pace.• Incremental upgrades reduce the cost and risk of change!

Example:For a database foo which contains view bar, we could have:bar_1_0_0, bar_1_1_0, bar_2_0_0 registered with Dali at the same time.

* We also register bar which is a latest pointer to bar_2_0_0

Page 16: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Semantic Versioning for ViewsMajor Version• Backward incompatible changes to the view schema

• Removing a field• Changing the physical type of an existing field

Minor Version• Backward compatible changes visible to consumers of the view

• Adding a new field to the schema

Patch Version• Everything else that doesn’t alter the schema or semantic output of the view

• Updating one of the view’s binary dependencies• Updating SQL for better execution plan

Page 17: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Leveraging existing LI tools INFRA

Query view/UDF version dependency graphwho-depends-on-me?

Deprecate, EOL, and purge a specific view/UDF version

Plug into existing global namespace management provided by LI developer toolsEnforce referential integrity for views at deployment time

Page 18: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Contract Law for DatasetsVague, poorly defined contracts bind data producers to consumers Physical types don’t tell us much

STRING or URI? STRING or ENUM?

Semantic types help, but what about other types of relationships? X IS NOT NULL A_time is in seconds, b_time is in millis

Attributes of a good contract Easy to find Easy to understand Easy to change

Page 19: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Hijacking an existing processExpress contracts as logical constraints against the fields of a viewMake the contract easy to find by storing it in the view’s Git repo

Contract negotiation follows an existing process Data producer (view owner) controls the ACL on the view repo Data consumer requests a contract change via ReviewBoard request View owner either accepts or rejects the pull request

If accepted, view version is bumped to notify downstream consumers If rejected, consumer still has the option of committing the constraint to their own repo

Contract Constraint based testing for viewsContract Data Quality tests

Page 20: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Case Study: Project VoyagerViews allowed us to parallelize development by decoupling the online and offline sides of the project.• Read existing data using new

schemas• Legacy apps can continue using old

schemas

~ 100 views for the Voyager project• 31 consumer (leaf) views• 63 producer views• Dependencies on 48 unique tables

Page 21: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Why Dali?Consumers Make data stable, predictable, discoverable

Producers Explicit, manageable contracts with consumers Frictionless, familiar process for modifying existing contracts

Infra Providers Freedom to optimize Flow portability DR, multi-DC scheduling

Page 22: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

Simplifying with Views

Page 23: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.

Page 24: LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016

[email protected]/in/carlsteinbach@cwsteinbach