linkedin's logical data access layer for hadoop -- strata london 2016
TRANSCRIPT
10 Clusters1000 Users
10,000 FlowsThe Dali Experience at LinkedIn
Carl SteinbachSenior Staff Software EngineerLinkedIn Data Analytics Infrastructure Groupin/carlsteinbach@cwsteinbach
Hadoop @ LinkedIn: Circa 20081 cluster20 nodes10 users10 production workflowsMapReduce, Pig
Hadoop @ LinkedIn: NOW> 10 clusters> 10,000 nodes> 1,000 users
Hundreds of production workflows, thousands of development flows and ad-hoc QsMapReduce, Pig, Hive, Gobblin, Cubert, Scalding, Spark, Presto, …
What did we learn along the way?
Scaling Hardware Infrastructure is Hard
What did we learn along the way?
Scaling Human Infrastructure is Harder
6
Hidden,constantlyevolvingdependenciesbindingproducers,consumers,andinfraproviders
Motivations: Producers, ConsumersData Consumers have to manage too many details:• Where is the data located? (cluster, path)• How is the data partitioned? (logical physical mapping)• How do I read the data? (storage format, wire protocol)
Data Producers are flying blind:• Who is consuming the data that I produce?• Will anything break if I make this change?• Deprecating legacy schemas is too expensive.
Motivations: Infra ProvidersThis mess makes things really hard for infrastructure providers!
Lots of optimizations are impossible because producers/consumer logic locks us into what should be backend decisions• Storage format• Physical partitioning scheme• Data location, wire protocol
Lots of redundant code paths to support: Spark, Hive, Presto, Pig, etc
Dali Vision and MissionMotivation:
Make analytics infrastructure invisible by abstracting away the underlying physical details
Mission: Make data on HDFS easier to access + manage Filesystem: protocol-independent, multi-cluster Datasets: tables not files Views: virtual datasets, contract management for producers and consumers Lineage and Discovery: map datasets to producers, consumers, and track
changes over time
Dali Dataset API: Catalog Service
Is a Dataset API Enough?Some use cases at LinkedIn: Structural transformations (flattening and nesting) Muxing and de-muxing data (unions) Patching bad data Backward incompatible changes (intentional and otherwise…) Code reuse
What we need: Ability to decouple the API from the dataset Producer control over public and private APIs Tooling and processes to support safe evolution of these APIs
Dali View
A sample view
CREATE VIEW profile_flattened TBLPROPERTIES( 'functions' = 'get_profile_section:isb.GetProfileSections', 'dependencies' = 'com.linkedin.dali-udfs:get-profile-sections:0.0.5')AS SELECT get_profile_section(...)FROM prod_identity.profile;
Reading a Dali View from Pig
register ivy://com.linkedin.dali:dali-all:2.3.52;
data = LOAD ‘dalids:///tracking.pageviewevent’ USING DaliStorage();
data = FILTER data BY datepartition >= ‘2016-05-08-00’ AND datepartition <= ‘2016-05-10-00’
View Versioning• Views can evolve to add/remove fields, update UDF and views/table
dependencies, update logic, etc.• Multiple versions of each view can be registered with the Dali at the same time.• Consumers can migrate to newer versions at their own pace.• Incremental upgrades reduce the cost and risk of change!
Example:For a database foo which contains view bar, we could have:bar_1_0_0, bar_1_1_0, bar_2_0_0 registered with Dali at the same time.
* We also register bar which is a latest pointer to bar_2_0_0
Semantic Versioning for ViewsMajor Version• Backward incompatible changes to the view schema
• Removing a field• Changing the physical type of an existing field
Minor Version• Backward compatible changes visible to consumers of the view
• Adding a new field to the schema
Patch Version• Everything else that doesn’t alter the schema or semantic output of the view
• Updating one of the view’s binary dependencies• Updating SQL for better execution plan
Leveraging existing LI tools INFRA
Query view/UDF version dependency graphwho-depends-on-me?
Deprecate, EOL, and purge a specific view/UDF version
Plug into existing global namespace management provided by LI developer toolsEnforce referential integrity for views at deployment time
Contract Law for DatasetsVague, poorly defined contracts bind data producers to consumers Physical types don’t tell us much
STRING or URI? STRING or ENUM?
Semantic types help, but what about other types of relationships? X IS NOT NULL A_time is in seconds, b_time is in millis
Attributes of a good contract Easy to find Easy to understand Easy to change
Hijacking an existing processExpress contracts as logical constraints against the fields of a viewMake the contract easy to find by storing it in the view’s Git repo
Contract negotiation follows an existing process Data producer (view owner) controls the ACL on the view repo Data consumer requests a contract change via ReviewBoard request View owner either accepts or rejects the pull request
If accepted, view version is bumped to notify downstream consumers If rejected, consumer still has the option of committing the constraint to their own repo
Contract Constraint based testing for viewsContract Data Quality tests
Case Study: Project VoyagerViews allowed us to parallelize development by decoupling the online and offline sides of the project.• Read existing data using new
schemas• Legacy apps can continue using old
schemas
~ 100 views for the Voyager project• 31 consumer (leaf) views• 63 producer views• Dependencies on 48 unique tables
Why Dali?Consumers Make data stable, predictable, discoverable
Producers Explicit, manageable contracts with consumers Frictionless, familiar process for modifying existing contracts
Infra Providers Freedom to optimize Flow portability DR, multi-DC scheduling
Simplifying with Views
©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.
[email protected]/in/carlsteinbach@cwsteinbach