big data analytics in linkedin -...
TRANSCRIPT
![Page 1: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/1.jpg)
Big Data Analytics in LinkedIn
![Page 2: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/2.jpg)
2
![Page 3: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/3.jpg)
Brief History of LinkedIn
- Launched in 2003 by Reid Hoffman (https://ourstory.linkedin.com/)
- 2005: Introduced first business lines : Jobs and Subscriptions
- 2006: Launched public profiles (achieved portability/new features)
- 2008: LinkedIn goes GLOBAL! (https://business.linkedin.com/)
- 2012: Site transformation/rapid growth
- 2013: ~225 million members (27 % of LinkedIn subscribers are recruiters)
3
![Page 4: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/4.jpg)
4
![Page 5: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/5.jpg)
5
![Page 6: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/6.jpg)
Three Major Data Dimensions @LinkedIn
6
![Page 7: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/7.jpg)
LinkedIn Challenges for Web-scale OLAP
● Horizontally scalable
○ currently over 200+ million users
○ adding 2 new members per second
● Quick response time to user’s queries
● High availability
● High read & write throughput (billions of monthly page views)
● Heavy dependency on slowest node’s response as data is spread across
various nodes 7
![Page 8: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/8.jpg)
Current OLAP Solutions - not suited for high-traffic website
● What is OLAP - Online Analytical Processing○ Long transactions
○ Complex queries
○ Mining and analyzing large amounts of data
○ Infrequent updates of data
Traditional for Business Intelligence (i.e. SAP, Oracle and etc)
retrieve & consolidate partial results across nodes (causing slow responses)
Distributed (problems: w/latency, availability and cost)
Materialized Cubes (loading billions of page views - load too high)
8
![Page 9: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/9.jpg)
Avatara: solution for Web-scale Analytics Products
● Provides fast scalable OLAP system
○ handles small cubes scenarios
○ simple grammar for cube construction and query at scale
○ sharding of cube dimension into key-value model
○ leverage distributed key-value store for low-latency
○ high availability access to cubes
○ leverages hadoop for joins
9
![Page 10: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/10.jpg)
Avatara: solution for Web-scale Analytics Products
● Two examples of analytics features:
○ WVMP - cube sharded by member ID
■ Who’s viewed my profile? (WVMP)
○ WVTJ - cube sharded across jobs
■ Who’s viewed this job? (WVTJ)
10
![Page 11: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/11.jpg)
Avatara: solution con’t
● Sharding (i.e horizontal scaling)○ divides the data set and distributes the data over multiple servers. Each
shard is an independent database and together the shards make up a single logical database■ sharding on a primary key (turning a big cube into smaller ones)
● Store cube data’s in one location requires a single disk fetch
● Offline Batch Engine○ High throughput○ Batch processing (Hadoop Jobs)
● Online Query Engine○ low latency, high availability○ key-value paradigm for storing data (Voldemort)
11
![Page 12: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/12.jpg)
Avatara: Architecture
--
12
![Page 13: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/13.jpg)
Avatara: Offline Batch Engine - Three Phases- driven by a simple configuration file
Phase 1: Preprocessing
○ preparing the data
○ using built-in functions to roll up data
○ customized scripting for further processing
● Phase 2: Projections and Joins
○ builds the dimension & fact tables
○ a join key ties dimension & fact tables
13
![Page 14: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/14.jpg)
Avatara: Offline Batch Engine - Three Phases
● Phase 3: Cubification
○ partitions the data by cube shard key & produces small cubes
○ data can be retrieved in a single disk fetch for faster responses
○ cubes are bulk loaded into a distributed key-value store (i.e. Voldemort)
14
![Page 15: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/15.jpg)
Avatara: Online Query Engine
Serves queries in real time
Retrieves & processes data from key-value store (i.e. Voldemort)
Fast retrieval because of compact cubes per sharded key (i.e. member_id)
SQL-like syntax for clients
Supports select, where, group-by, having, order and etc. operations
Simplifies development for developers 15
![Page 16: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/16.jpg)
Cube Thinning● Avatara’s mechanism for thinning cubes too large to process on page load
(such as: President Obama or Lebron James)
● Allows developers to do the following:
○ set priorities and constraints
■ on dimensions aggregated to a specific value (such as “other” category)
○ drop data across pre-defined dimensions
■ ex: WVMP can opt to drop data across time dimension
● resulting in a shorter history!16
![Page 17: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/17.jpg)
17
![Page 18: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/18.jpg)
18
![Page 19: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/19.jpg)
In SummaryAvatara has been working several years at LinkedIn (i.e. in-house OLAP system)
Allows developers to build OLAP cubes with a single configuration file
Hybrid offline/online strategy combined with sharding into key-value store
Powers large web-scale applications such as: WVMP, WVTJ and Jobs You May
Be Interested In
Avatara uses Hadoop for batch computing infrastructure
SQL-like query interaction
Hadoop batch engine can handle TBs of data & process in less than hrs of time
Voldemort can respond to online queries in milliseconds19
![Page 20: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/20.jpg)
Future Work
○ Near real-time cubing
○ Streaming joins
○ Dimension and schema changes
20
![Page 21: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/21.jpg)
Structure of Companies Data
21
![Page 22: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/22.jpg)
Structure of Jobs Data
22
![Page 23: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/23.jpg)
Structure of Person Data
23
![Page 24: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/24.jpg)
HQL (Hive Query Language)Top companies with highest # followers
Top locations with highest job count
Job title and count per location
Top job titles recently listed
Location of jobs listed “1 day ago”
Comparison of # of connections of people with and without profile image
Comparison Profile Headlines with Highest Connection Count vs those with lower
connection count
Query visualization done in Tableau
24
![Page 25: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/25.jpg)
insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ','Select
followercount, name, rank() over (ORDER BY followercount DESC) as rank from companies
ranked_followers WHERE ranked_followers.rank < 10 ORDER BY followercount DESC;
25
Top companies with highest
number of
followers
F1~ # of
followers
![Page 26: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/26.jpg)
insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' Select location, jobcount FROM (select location, rank() over (ORDER BY jobcount DESC) as rank, jobcount from companies) ranked_jobs WHERE ranked_jobs.rank < 51 ORDER BY location, jobcount DESC;
26
Top
locations that have
the highest
number of
jobs
F2~ # of jobs
![Page 27: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/27.jpg)
insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ','SELECT c.location, j.jobTitle FROM companies c left outer join jobs j on (c.location = j.location);
27
Join on companies and jobs table selecting location
and jobtitle (looking at number of jobs listed in each area)
![Page 28: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/28.jpg)
insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' SELECT companyName, jobTitle, jobRecency FROM (select companyName, jobTitle, rank() over (ORDER BY jobRecency DESC) as rank, jobRecency from jobs) ranked_jobTitles WHERE ranked_jobTitles.rank < 11 ORDER BY jobTitle, jobRecency DESC;
28
Top Job titles
recently listed
![Page 29: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/29.jpg)
insert overwrite local directory '/usr/local/hql' row format delimited fields terminated by ',' select location, companyName, jobTitle from jobs where jobRecency="1 day ago";
29
locations
of jobs listed 1
day ago
![Page 30: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/30.jpg)
insert overwrite local directory '/usr/local/hql' select count(*), sum(connectionCount) from person where imageUrl !="undefined";insert overwrite local directory '/usr/local/hql' select count(*), sum(connectionCount) from person where imageUrl ="undefined";
30
● Comparison: # connections of people with and without profile photo on webpage.
● ratio 5 : 454
● on Average those ○ w/out profile pic: ~470
connections○ with profile pic: ~394
● 76 person connection difference!
![Page 31: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/31.jpg)
31
insert overwrite local directory '/usr/local/hql' select connectionCount, firstName, headline from
person where connectionCount > 500;
Profile Headlines with Highest Connections
![Page 32: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/32.jpg)
insert overwrite local directory '/usr/local/hql' select connectionCount, firstName, headline from
person where connectionCount < 200;
32
Profile Headlines with lowestConnections
![Page 33: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/33.jpg)
Interested in trying on your own?
Links:
FireBug add-on to FireFox:
https://addons.mozilla.org/en-us/firefox/addon/firebug/
Jase Clamp tutorial “Extracting Data From LinkedIn”:
https://www.youtube.com/watch?v=S-9BWrtxoDw
Data Extraction Script on Github:
https://gist.github.com/jaseclamp/2c74062bac1cc4dd929f\
Tableau Download:
http://www.tableau.com/products/desktop/download?os=windows
33
![Page 34: Big Data Analytics in LinkedIn - eecs.csuohio.edueecs.csuohio.edu/~sschung/cis611/BigDataPresentationonLinkedIn... · Brief History of LinkedIn - Launched in 2003 by Reid Hoffman](https://reader034.vdocuments.us/reader034/viewer/2022042218/5ec47819854a2c0294353485/html5/thumbnails/34.jpg)
Sources
1. http://vldb.org/pvldb/vol5/p1874_liliwu_vldb2012.pdf
2. http://www.slideshare.net/liliwu/avatara-olap-for-webscale-analytics-products
3. https://ourstory.linkedin.com/#year-2004
4. http://www.slideshare.net/MichaelLi17/how-business-analytics-drives-business-
value-teradata-partners-conference-nashvile-2014?next_slideshow=1
5. https://engineering.linkedin.com/olap/avatara-olap-web-scale-analytics-products
6. https://www.youtube.com/watch?v=9s-vSeWej1U
34