linkedin infrastructure (analytics@webscale, at fb 2013)

17
Data Infrastructure at Linkedin Jun Rao and Sam Shah LinkedIn Confidential ©2013 All Rights Reserved

Upload: jun-rao

Post on 11-May-2015

888 views

Category:

Technology


3 download

DESCRIPTION

This is the presentation at analytics@webscale in 2013 (http://analyticswebscale.splashthat.com/?em=187&utm_campaign=website&utm_source=sg&utm_medium=em)

TRANSCRIPT

Page 1: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

LinkedIn Confidential ©2013 All Rights Reserved

Data Infrastructure at Linkedin

Jun Rao and Sam Shah

Page 2: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

LinkedIn Confidential ©2013 All Rights Reserved 2

Outline

1. LinkedIn introduction

2. Online/nearline infrastructure

3. Offline infrastructure

4. Conclusion

Page 3: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

The World’s Largest Professional Network

Members Worldwide

2 newMembers Per Second

100M+Monthly Unique Visitors

200M+ 2M+ Company Pages

Connecting Talent Opportunity. At scale…

LinkedIn Confidential ©2013 All Rights Reserved 3

Page 4: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

LinkedIn Confidential ©2013 All Rights Reserved 4

Two Product Families

Data

Data Infrastructure

Science and Analytics

Professionals Companies

Connections

Profiles Actions

Content

For Members For Partners

People You May Know Who’s Viewed My Profile Jobs You May Be

Interested In News/Sharing Today Search Subscriptions

Hire

Market

Sell

Page 5: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

LinkedIn Confidential ©2013 All Rights Reserved 5

The Big-Data Feedback Loop

Value

Insights

Scale

Product

ScienceData

Member

Engagement

Virality

Signals

Refinement

InfrastructureAnalytics

Page 6: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

LinkedIn Confidential ©2013 All Rights Reserved 6

LinkedIn Data Infrastructure: Three-Phase Abstraction

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Infrastructure Latency & Freshness Requirements Products

Online Activity that should be reflected immediately• Member Profiles• Company Profiles• Connections

• Messages • Endorsements• Skills

Near-Line Activity that should be reflected soon

• Activity Streams• Profile Standardization• News

• Recommendations• Search• Messages

Offline Activity that can be reflected later

• People You May Know• Connection Strength• News

• Recommendations• Next best idea…

Page 7: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

7

LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase ecosystem are diverse, complex and specific

Some off-the-shelf.Significant investment in home-grown, deep and

interesting platforms

Databus

Page 8: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

8

Voldemort: Highly-Available Distributed KV Store

LinkedIn Data Infrastructure Solutions

• Key/value access at scale

Page 9: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

• Pluggable components• Tunable consistency /

availability• Key/value model,

server side “views”

• 10 clusters, 100+ nodes• Largest cluster – 10K+ qps• Avg latency: 3ms• Hundreds of Stores• Largest store – 2.8TB+

Voldemort: Architecture

Page 10: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

10

Espresso: Indexed Timeline-Consistent Distributed Data Store

LinkedIn Data Infrastructure Solutions

• Fill in the gap btw Oracle and KV store

Page 11: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

11

Espresso: System Components

• Hierarchical data model• Timeline consistency• Rich functionality

• Transactions• Secondary index• Text search

• Partitioning/replication• Change propagation

Page 12: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

12

Generic Cluster Manager: Helix

• Generic Distributed State Model• Config Management• Automatic Load Balancing• Fault tolerance• Cluster expansion and rebalancing

• Espresso, Databus and Search• Open Source Apr 2012• https://github.com/linkedin/helix

Page 13: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

Databus : Timeline-Consistent Change Data Capture

LinkedIn Data Infrastructure Solutions

• Deliver data store changes to apps

Page 14: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

Databus at LinkedIn

14

DB

Bootstrap

CaptureChanges

On-lineChanges

On-lineChanges

DB

Compressed

Delta Since T

Consistent

Snapshot at U

Transport independent of data source: Oracle, MySQL, …

Transactional semantics In order, at least once delivery

Tens of relays Hundreds of sources Low latency - milliseconds

Consumer 1

Consumer n

Client

Dat

abus

C

lient

Lib

Consumer 1

Consumer n

Dat

abus

C

lient

Lib

Client

Relay

Event Win

Page 15: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

15

Kafka: High-Volume Low-Latency Messaging System

LinkedIn Data Infrastructure Solutions

• Log aggregation and queuing

Page 16: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

Kafka Architecture

Producer

Consumer

Producer

Consumer

Zookeeper

topic1-part1

topic2-part2

topic2-part1

topic1-part2

topic2-part2

topic2-part1

topic1-part1 topic1-part2

topic1-part1 topic1-part2

topic2-part2

topic2-part1

Broker 1 Broker 2 Broker 3 Broker 4

Key features• Scale-out architecture• Automatic load balancing• High throughput/low latency• Rewindability• Intra-cluster replication

Per day stats• writes: 10+ billion messages• reads: 50+ billion messages

Page 17: LinkedIn Infrastructure (analytics@webscale, at fb 2013)

LinkedIn Confidential ©2013 All Rights Reserved 17

LinkedIn Data Infrastructure: A few take-aways

1. Building infrastructure in a hyper-growth environment is challenging.

2. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*)

3. Balance open-source products with home-grown platforms (**)