Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn
Mar 16, 2017
Shirshanka Das, Principal Staff Engineer, LinkedIn Yael Garten, Director of Data Science, LinkedIn
@shirshanka, @yaelgarten
Achieve Data Democracy
Data Scientists write code
Unleash Insights
Share Learnings at Strata!
Three (Naïve) Steps to #DataScienceHappinesscirca 2010
Achieve Data Democracy
Data Scientists write code
Unleash Insights
Share Learnings at Strata!
Three (Naïve) Steps to #DataScienceHappinesscirca 2010
Achieving Data Democracy
“… helping everybody to access and understand data .… breaking down silos… providing access to data when and where it is needed at any given moment.”
Collect, flow, store as much data as you can Provide efficient access to data in all its stages of evolution
The forms of data
Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable
Espresso Venice
Pinot Galene Graph DB
Document DB
DynamoDB
Azure Blob, Data Lake Storage
The forms of data
At Rest In Motion
Espresso Venice
Pinot
Galene
Graph DBDocument DB
DynamoDB
Azure Blob, Data Lake Storage
The forms of data
At Rest In Motion
Scale O(10) clusters ~1.7 Trillion messages ~450 TB
Scale O(10) clusters ~10K machines ~100 PB
Data Integration: key requirements
Source, Sink Diversity
Batch +
StreamingData
Quality
So, we built
SFTP
JDBCREST
Simplifying Data Integration
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ github.com/linkedin/gobblin
Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, NerdWallet and many more…
Apache incubation under way
SFTP
Azure Blob, Data Lake Storage
Kafka Hadoop
Samza Jobs
Pinot
minuteshour +
Distributed Multi-dimensional OLAP Columnar + indexes No joins Latency: low ms to sub-second
Query Engines
How does LinkedIn build data-driven products?
Data ScientistPM Designer
Engineer
We should enable users to filter connection suggestions by company
How much do people utilize existing filter capabilities?
Let's see how users send connection invitations today.
(powers metrics and data products)
InvitationClickEvent()
Scale fact: ~ 1000 tracking event types, ~ Hundreds of metrics & data products
Tracking data records user activity
user engagement tracking data
metric scripts
production code
Tracking Data Lifecycle
TransportProduce Consume
Member facing data products
Business facing decision making
Tracking Data Lifecycle & Teams
Product or App teams: PMs, Developers, TestEng
Infra teams: Hadoop, Kafka, DWH, ...
Data science teams: Analytics, ML Engineers,...
user engagement tracking data
metric scripts
production code
Member facing data products
Business facing decision making
TransportProduce Consume
Members Execs
How do we calculate a metric: ProfileViewsPageViewEvent
Record1:{"memberId":12345,"time":1454745292951,"appName":"LinkedIn","pageKey":"profile_page","trackingInfo":“Viewee=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|..."}
Metric: ProfileViews = sum(PageViewEvent where pagekey = profile_page )
PageViewEvent
Record101:{"memberId":12345,"time":1454745292951,"appName":"LinkedIn","pageKey":"new_profile_page","trackingInfo":"viewee_id=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|..."}
or new_profile_page
Ok but forgot to notifyundesirable
Metrics ecosystem at LinkedIn: 3 yrs ago
Operational Challenges for infra teams Diminished Trust due to multiple sources of truth
What was causing unhappiness?1. No contracts: Downstream scripts broke when upstream changed
2. "Naming things is hard": different semantics & conventions in various data Events (per team) --> need to email to figure out what is correct and complete logic to use --> inefficient and potentially wrong
3. Discrepant metric logic: Duplicate tech allowed for duplicate logic allowed for discrepant metric logic
So how did we solve this?
Data Modeling Tip Say no to Fragile Formats or Schema-Free
Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing your messages to your persistent stores: Kafka, Hadoop, DBs etc.
1. No contracts 2. Naming things is hard 3. Discrepant metric logic
Chose Avro as our format
Sometimes you need a committee
Leads from product and infra teams Review each new data model Ensure that it follows our conventions, patterns and best practices across entire data lifecycle
1. No contracts 2. Naming things is hard 3. Discrepant metric logic
Data Model Review Committee (DMRC)
Tooling to codify conventions “Always be reviewing”
Who and What Evolution
Unified Metrics PlatformA single source of truth for all business metrics at LinkedIn
1. No contracts 2. Naming things is hard 3. Discrepant metric logic
- metrics processing platform as a service - a metrics computation template - a set of tools and process to
facilitate metrics life-cycle
Central Team, Relevant Stakeholders
Sandbox
Metric Definition
Code Repo
Build & Deploy
System JobsCore Metrics Job
Metric Owner
1. iterate
2. create
4. check in
3. review
5,000 metrics daily
Unified Metrics Platform: Pipeline
Metrics Logic
Raw Data
Pinot
UMP HarnessIncremental Aggregate Backfill Auto-join
Raptor dashboards
HDFS
Aggregated Data
ExperimentAnalysis
Machine Learning
Anomaly Detection
HDFS
Ad-hoc
1. No contracts 2. Naming things is hard 3. Discrepant metric logic
Tracking + Database + Other data
Tracking Platform: standardizing production
Schema compatibility Time Audit
KafkaClient-side Tracking
Tracking Frontend
Services
Tools
Query Engines
At RestIn Motion
Processing Frameworks
Data Infra + Platforms 2.0
Pinot
Tracking Platform Unified Metrics Platform (UMP)Production Consumption
circa 2015
What was still causing unhappiness?1. Old bad data sticks around (e.g. old mobile app versions) 2. No clear contract for data production - Producers unaware of consumers concerns 3. Never a good time to pay down this tech debt
We started from the bottom.
Product or App teams: PMs, Developers, TestEng
Infra teams: Hadoop, Kafka, DWH, ...
Data science teams: Analytics, ML Engineers,...
user engagement tracking data
metric scripts
production code
Member facing data products
Business facing decision making
Members Execs
3. Never a good time to pay down this "data" debt
#victimsOfTheData —> #DataScienceHappiness via proactively forging our own data destiny.
Features are waiting to ship to members... some of this stuff is invisible But what is the cost of not doing it?
PageViewEvent {"header":{"memberId":12345,"time":1454745292951,"appName":{"string":"LinkedIn""pageKey":"profile_page"},},"trackingInfo":{["Viewee":"23456"], ...}}
We already wanted to move to better data models
ProfileViewEvent
{"header":{"memberId":12345,"time":4745292951145,"appName":{"string":"LinkedIn""pageKey":"profile_page"},},"entityView":{"viewType":"profile-view","viewerId":“12345”,
"vieweeId":“23456”,},}
viewee_ID
1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from
scratch), b. Save: consumers avoid migrating.
2. Evolve. a. Cost: time on clean data modeling, and on consumer
migration to new tracking events, b. Save: pays down data modeling tech debt
There were two options:
1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from
scratch), b. Save: consumers avoid migrating.
2. Evolve. a. Cost: time on clean data modeling, and on consumer
migration to new tracking events, b. Save: pays down data modeling tech debt
How much work would it be?
#DataScienceHappiness
2. Clear contract did not exist for data productionProducers were unaware of consumers needs, and were "Throwing data over the wall". Albeit avro, Schema adherence != Semantics equivalence
user engagement tracking data
metric scripts
productioncode
Member facingdata products
Business facing decision making
#victimsOfTheData —> #DataScienceHappiness, via proactive joint requirements definition
Own the artifact that feeds the data ecosystem (and data scientists!)
Data producers (PM, app developers)
Data consumers (DS)
2a. Ensure dialogue between Producers & Consumers• Awareness: Train about end-to-end data pipeline, data modeling • Instill communication & collaborative ownership process between all: a step-by-step
playbook for who & how to develop and own tracking
2b. Standardized core data entities• Event types and names: Page, Action, Impression
• Framework level client side tracking: views, clicks, flows
• For all else (custom) - guide when to create a new Event
Navigation
Page View
Control Interaction
2c. Created clear maintainable data production contracts
Tracking specification with monitoring and alerting for adherence: clear, visual, consistent contract
Need tooling to support culture and process shift - "Always be tooling"
Tracking specification Tool
1. Old bad data sticks around
PageViewEvent
{ "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } }
ProfileViewEvent
{ "header" : { "memberId" : 12345, "time" : 4745292951145, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "entityView" : { "viewType" : "profile-view", "viewerId" : “12345”,
"vieweeId" : “23456”, }, }
How do we handle old and new?PageViewEvent
ProfileViewEvent
Producers Consumers
old
new
Relevance
Analytics
The Big Challenge
load “/data/tracking/PageViewEvent” using AvroStorage()
(Pig scripts)
My Raw Data
Our scripts were doing ….
We built Dali to solve this
A Data Access Layer for Linkedin
Abstract away underlying physical details to allow users to
focus solely on the logical concerns
Logical Tables + Views
Logical FileSystem
Solving With Views
Producers
LinkedInProfileView
PageViewEvent
ProfileViewEventnew
old
Consumers
pagekey==
profile
1:1
Relevance
Analytics
Views ecosystem
51
Producers Consumers
LinkedInProfileView
JSAProfileViewJob Seeker App (JSA)
LinkedIn App
UnifiedProfileView
Dali: Implementation Details in Context
Dali FileSystem
Processing Engine (MR, Spark)
Dali Datasets (Tables+Views)
Dataflow APIs (MR, Spark, Scalding)
Query Layers (Pig, Hive, Spark)
Dali CLI
Data Catalog
Git + Artifactory
View Def + UDFs
Dataset Owner
Data SourceData Sink
From
load ‘/data/tracking/PageViewEvent’ using AvroStorage();
To
load ‘tracking.UnifiedProfileView’ using DaliStorage();
One small step for a script
A Few Hard Problems
Versioning Views and UDFs
Mapping to Hive metastore entities Development lifecycle
Git as source of truth
Gradle for build LinkedIn tooling integration for deployment
State of the world today
~300 views
Pretty much all new UMP metrics use Dali data sources
ProfileViews MessagesSent Searches InvitationsSent ArticlesRead JobApplications ...
At Rest
Data Processing Frameworks
Now brewing: Dali on Kafka
Can we take the same
views and run them
seamlessly on Kafka as
well?
Stream Data
Standard streaming API-s - Samza System Consumer - Kafka Consumer
What’s next for Dali?
Selective materialization
Open source
Hive is an implementation detail, not a long term bet
Query Engines
At RestIn Motion
Processing Frameworks
Data Infra + Platforms 3.0
Pinot
Tracking Platform Unified Metrics Platform (UMP)
DaliDr Elephant WhereHows
circa 2017
Achieve Data Democracy
Data Scientists write code
Unleash Insights
Share Learnings at Strata
Three (Naïve) Steps to #DataScienceHappiness
Basic data infrastructure for data democracy
Platforms, Process to standardize produce + consume
Evangelize investing
in #DataScience
Happiness
Tech + processto sustain
healthy data ecosystem
Our Journey towards #DataScienceHappiness
Dali, Dialogue 2015->
Tracking, UMP DMRC 2013 ->
Kafka, Hadoop, Gobblin, Pinot 2010 -> 2015 ->