path to 400m members: linkedin’s data powered journey
TRANSCRIPT
![Page 1: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/1.jpg)
Xin Fu, Carl Steinbach
Hadoop SummitTokyo, October 26, 2016
Path to 400M* Members: LinkedIn’s Data Powered Journey
* As of Q2 2016, LinkedIn had 450M members world wide
![Page 2: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/2.jpg)
2
2004
2011 2012
2009
2012 2015
![Page 3: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/3.jpg)
3
Real Time Visualization of New Sign-ups
![Page 4: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/4.jpg)
What Does “Data-Driven” Mean at LinkedIn?
4
![Page 5: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/5.jpg)
What Does “Data-Driven” Mean at LinkedIn?
5
![Page 6: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/6.jpg)
Monitoring & Learning
6
![Page 7: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/7.jpg)
What is This Phase Comprised of?
7
● Dashboards● Reports
● Trend explanation
○ Short term fluctuation: investigation
○ Long term trend: strategic analysis
![Page 8: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/8.jpg)
Past Challenges
8
Reliability● Easily broken without operational support, huge time spent in
maintenance
Diverse technology● Self maintained pipelines● Various UIs with different visualization capabilities● Redundant computation
![Page 9: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/9.jpg)
Standardized Reporting Tool
9
● Reduces dependency on 3rd party BI tools● Closer integration with LinkedIn’s ecosystem of experimentation
and anomaly detection solutions
![Page 10: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/10.jpg)
Towards Real Time Monitoring
10
Sign
-up
Country
Platform
Language
Browser
Signup Type
OS
![Page 11: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/11.jpg)
Experimentation & Analysis
11
![Page 12: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/12.jpg)
What is This Phase Comprised of?
12
● Experiment design● Experiment analysis to inform ramp decisions
● Learning from multiple experiments to identify what works and what doesn’t work
![Page 13: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/13.jpg)
Past Challenges
13
Experiment design● Interaction between experiments
Experiment analysis and ramp decision● Manual analysis, extended time-to-
decision● Ramp decisions based on localized
metrics● Reruns needed sometimes due to
undetected errors in setup
Worst of all, some ramps happened without A/B testing● e.g. infrastructural changes
![Page 14: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/14.jpg)
Experimentation Platform @ LinkedIn
14
● Company-wide platform for A/B testing, ramping, and advanced targeting needs
● Automated reporting and analysis capabilities
![Page 15: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/15.jpg)
Tiering of Metrics
15
Metrics at different tier:● Different review processes
● Different levels of visibility in dashboards and experiment scorecards
● Different computation priorities and SLAs in data pipelines
● Different life cycles
![Page 16: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/16.jpg)
Backend Infrastructure for Tracking & Instrumentation
16
![Page 17: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/17.jpg)
17
InvitationClickEvent()
Scale fact: ~1000 tracking event types, ~20TB per day, hundreds of metrics & data products
Tracking Data Records User Activity
![Page 18: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/18.jpg)
Tracking Data Lifecycle and Teams
18
Product teams:PMs, Developers, TestEng
Infra teams: Hadoop, Kafka, DWH, ...
Data teams: Analytics, Relevance Engineers,...
![Page 19: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/19.jpg)
Example: How Do We Track a Profile View?
19
PageViewEventRecord 1:{"header" : {"memberId" : 12345,"time" : 1454745292951,"appName" : {"string" : "LinkedIn"
"pageKey" : "profile_page"},
},"trackingInfo" : {["vieweeID" : "23456"],
...}
}
pageViews = LOAD ‘/data/tracking/PageViewEvent’;
profileViews = FILTER pageViews by header.pageKey==‘profile_page’;
![Page 20: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/20.jpg)
Example: How Do We Track a Profile View?
20
PageViewEventRecord 1:{"header" : {"memberId" : 12345,"time" : 1454745292951,"appName" : {"string" : "LinkedIn"
"pageKey" : "new_profile_page"},
},"trackingInfo" : {["vieweeID" : "23456"],
...}
}
pageViews = LOAD ‘/data/tracking/PageViewEvent’;
profileViews = FILTER pageViews by header.pageKey==‘profile_page’ or header.pageKey==‘new_profile_page’;
![Page 21: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/21.jpg)
At Some Point It Becomes Unmaintainable ...
21
![Page 22: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/22.jpg)
How Do We Handle Old and New?
22
Producers Consumers
![Page 23: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/23.jpg)
DALI: A Data Access Layer for LinkedInAbstract away underlying physical details to allow users to focus solely on the logical concerns
Logical Tables + Views
Logical FileSystem
We had been working on something that could help...
![Page 24: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/24.jpg)
24
Data Catalog + Discovery
(DALI)
DaliFileSystem Client
Data Source(HDFS)
Data Sink(HDFS)
Processing Engine(MapReduce, Spark, Presto)
DALI Datasets (Tables + Views)
Query Layers (Hive, Pig, Spark)
View Defs + UDFs(Artifactory, Git)
Dataflow APIs(MR, Spark, Scalding)DALI CLI
DALI: Implementation Details in Context
![Page 25: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/25.jpg)
Solving with DALI Views
Producers Consumers
![Page 26: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/26.jpg)
State of the World Today with Dali
~ 100 producer views~ 200 consumer views~ 80 unique tracking event data sources
What’s next?! Views on streaming data! Selective materialization and caching! Open source
![Page 27: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/27.jpg)
At the Core of “Data-Driven” is ....
27
![Page 28: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/28.jpg)
28
Used to be Tug of War Between Speed and Quality
![Page 29: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/29.jpg)
29
Before We Learned that Technology Could Break the Dichotomy Between Speed and Quality
![Page 30: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/30.jpg)
30
Cultural Aspects: Partnership Data Scientists and Engineers
![Page 31: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/31.jpg)
Interesting Challenges
- Metric trade-off, e.g. between engagement vs. monetization
- Real-time everything?- A/B test in a social
network- Human judge for
personalized search- Value of an action
31
![Page 32: Path to 400M Members: LinkedIn’s Data Powered Journey](https://reader031.vdocuments.us/reader031/viewer/2022022203/586fde0d1a28ab18428b6aa5/html5/thumbnails/32.jpg)
It Took a Village
32
Thanks to all the Data Scientists, Engineers and Product partners at LinkedIn for being part of this great journey!
https://engineering.linkedin.com/data