data infrastructure at linkedin

57
Data Infrastructure at Linkedin Jun Rao and Sam Shah LinkedIn Confidential ©2013 All Rights Reserved

Upload: amy-w-tang

Post on 11-May-2015

2.822 views

Category:

Technology


3 download

DESCRIPTION

This talk was given by Jun Rao (Staff Software Engineer at LinkedIn) and Sam Shah (Senior Engineering Manager at LinkedIn) at the Analytics@Webscale Technical Conference (June 2013).

TRANSCRIPT

Page 1: Data Infrastructure at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved

Data Infrastructure at Linkedin

Jun Rao and Sam Shah

Page 2: Data Infrastructure at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 2

Outline

1. LinkedIn introduction

2. Online/nearline infrastructure overview

3. Infrastructure for data mining

4. Conclusion

Page 3: Data Infrastructure at LinkedIn

The World’s Largest Professional Network

Members Worldwide

2 newMembers Per Second

100M+Monthly Unique Visitors

200M+ 2M+ Company Pages

Connecting Talent Opportunity. At scale…

LinkedIn Confidential ©2013 All Rights Reserved 3

Page 4: Data Infrastructure at LinkedIn

4

Member ProfilesLarge dataset

Medium writes

Very high reads

Freshness <1s

Page 5: Data Infrastructure at LinkedIn

5

People You May KnowLarge dataset

Compute intensive

High reads

Freshness ~hrs

Page 6: Data Infrastructure at LinkedIn

6

LinkedIn Today Moving dataset

High writes

High reads

Freshness ~mins

Page 7: Data Infrastructure at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 7

The Big-Data Feedback Loop

Value

Insights

Scale

Product

ScienceData

Member

Engagement

Virality

Signals

Refinement

InfrastructureAnalytics

Page 8: Data Infrastructure at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 8

LinkedIn Data Infrastructure: Three-Phase Abstraction

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Infrastructure Latency & Freshness Requirements Products

Online Activity that should be reflected immediately• Member Profiles• Company Profiles• Connections

• Messages • Endorsements• Skills

Near-Line Activity that should be reflected soon

• Activity Streams• Profile Standardization• News

• Recommendations• Search• Messages

Offline Activity that can be reflected later

• People You May Know• Connection Strength• News

• Recommendations• Next best idea…

Page 9: Data Infrastructure at LinkedIn

9

LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase ecosystem are diverse, complex and specific

Some off-the-shelf.Significant investment in home-grown, deep and

interesting platforms

Databus

Page 10: Data Infrastructure at LinkedIn

10

Streaming Transactions

Page 11: Data Infrastructure at LinkedIn

Databus : Timeline-Consistent Change Data Capture

LinkedIn Data Infrastructure Solutions

Page 12: Data Infrastructure at LinkedIn

Databus at LinkedIn

12

DB

Bootstrap

CaptureChanges

On-lineChanges

On-lineChanges

DB

Compressed

Delta Since T

Consistent

Snapshot at U

Transport independent of data source: Oracle, MySQL, …

Transactional semantics In order, at least once delivery

Tens of relays Hundreds of sources Low latency - milliseconds

Consumer 1

Consumer n

Client

Dat

abus

C

lient

Lib

Consumer 1

Consumer n

Dat

abus

C

lient

Lib

Client

Relay

Event Win

Page 13: Data Infrastructure at LinkedIn

13

Scaling Core Databases

RO

RO

RO

Page 14: Data Infrastructure at LinkedIn

14

Voldemort: Highly-Available Distributed KV Store

LinkedIn Data Infrastructure Solutions

Page 15: Data Infrastructure at LinkedIn

• Pluggable components• Tunable consistency /

availability• Key/value model,

server side “views”

• 10 clusters, 100+ nodes• Largest cluster – 10K+ qps• Avg latency: 3ms• Hundreds of Stores• Largest store – 2.8TB+

Voldemort: Architecture

Page 16: Data Infrastructure at LinkedIn

16

Streaming Non-transactional Events

Offline

Nearline

Processing

Page 17: Data Infrastructure at LinkedIn

17

Kafka: High-Volume Low-Latency Messaging System

LinkedIn Data Infrastructure Solutions

Page 18: Data Infrastructure at LinkedIn

Kafka Architecture

Producer

Consumer

Producer

Consumer

Zookeeper

topic1-part1

topic2-part2

topic2-part1

topic1-part2

topic2-part2

topic2-part1

topic1-part1 topic1-part2

topic1-part1 topic1-part2

topic2-part2

topic2-part1

Broker 1 Broker 2 Broker 3 Broker 4

Key features• Scale-out architecture• High throughput• Automatic load balancing• Intra-cluster replication

Per day stats• writes: 10+ billion messages• reads: 50+ billion messages

Page 19: Data Infrastructure at LinkedIn

19

Filling in the Data Store Gap

Text Search

Page 20: Data Infrastructure at LinkedIn

20

Espresso: Indexed Timeline-Consistent Distributed Data Store

LinkedIn Data Infrastructure Solutions

Page 21: Data Infrastructure at LinkedIn

Application View

21

Hierarchical data model

Rich functionality on resources Conditional updates Partial updates Atomic counters

Rich functionality withinresource groups

Transactions Secondary index Text search

Page 22: Data Infrastructure at LinkedIn

22

Espresso: System Components

• Partitioning/replication• Timeline consistency• Change propagation

Page 23: Data Infrastructure at LinkedIn

23

Generic Cluster Manager: Helix

• Generic Distributed State Model• Config Management• Automatic Load Balancing• Fault tolerance• Cluster expansion and rebalancing

• Espresso, Databus and Search• Open Source Apr 2012• https://github.com/linkedin/helix

Page 24: Data Infrastructure at LinkedIn

Infrastructure challenges in large-scale data mining

Putting it together

Page 25: Data Infrastructure at LinkedIn

Top complaints from data scientists

1 Getting the data in (Ingress ETL)

2 Getting the data out (Egress)

3 Workflow management

4 Model of computation

5 …

Page 26: Data Infrastructure at LinkedIn

Top complaints from data scientists

1 Getting the data in (Ingress ETL)

2 Getting the data out (Egress)

3 Workflow management

4 Model of computation

5 …

Page 27: Data Infrastructure at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 27

LinkedIn circa 2010

Page 28: Data Infrastructure at LinkedIn

O(n2) data integration complexity

Page 29: Data Infrastructure at LinkedIn

Infrastructure fragility

• Can’t get all data• Hard to operate• Multi-hour delay• Labor intensive• Slow• Does it work?

Page 30: Data Infrastructure at LinkedIn

Process fragility

• Labor intensive• One man’s

cleaning…

Page 31: Data Infrastructure at LinkedIn

Data model

{ tracking_code=null, session_id=42, tracking_time=Tue Jul 31 07:27:25 PDT 2010, error_key=null, locale=en_us, browser_id=ddc61a81-5311-4859-be42-ca7dc7b941e3, member_id=1213, page_key=profile, tracking_info=Viewee=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|, error_id=null, page_type=FULL_PAGE, request_path=view ...}

Page 32: Data Infrastructure at LinkedIn

Data model (cont’d)

{ article_id=5560874437395353942, title=Five Good Reasons to Hire the Unemployed, language=en_US, article_source=bit.ly,url=aHR0cDovL3d3dy5vbmV0aGluZ25ldy5jb20vaW5kZXgucGhwL3dvcmsvMTAyLWZpdmUtZ29vZC1yZWFzb25zLXRvLWhpcmUtdGhlLXVuZW1wbG95ZWQK,...}

Page 33: Data Infrastructure at LinkedIn

Problems

1 Data integration across systems

2 Fragile infrastructure

3 Lack of proper data models (ad-hoc)

Page 34: Data Infrastructure at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 34

LinkedIn 2013

Page 35: Data Infrastructure at LinkedIn

O(n) data integration

Page 36: Data Infrastructure at LinkedIn

Publish/subscribe commit log

Page 37: Data Infrastructure at LinkedIn

Data model

Hundreds of message types Thousands of fields What do they all mean? What happens when they change?

Page 38: Data Infrastructure at LinkedIn

Data model

1 Education

2 Push data cleanliness upstream

3 O(1) ETL

4 Evidence-based correctness

Page 39: Data Infrastructure at LinkedIn

Data model

DDL for data definition and schema Central versioned registry of all schemas Schema review Programmatic compatibility model

– Schema changes handled transparently

Page 40: Data Infrastructure at LinkedIn

Workflow

1 Check in schema

2 Code review

3 Ship

Seamless data load into downstream systems

Page 41: Data Infrastructure at LinkedIn

Audit trail

Page 42: Data Infrastructure at LinkedIn

Result: complete, verified copy of all data available

Page 43: Data Infrastructure at LinkedIn

Top complaints from data scientists

1 Getting the data in (Ingress ETL)

2 Getting the data out (Egress)

3 Workflow management

4 Model of computation

5 …

Page 44: Data Infrastructure at LinkedIn

Egress

store DATA into ‘kafka://…’ using Stream();

Page 45: Data Infrastructure at LinkedIn

Top complaints from data scientists

1 Getting the data in (Ingress ETL)

2 Getting the data out (Egress)

3 Workflow management

4 Model of computation

5 …

Page 46: Data Infrastructure at LinkedIn

46

Workflows

Job A

Job B

Job C

Page 47: Data Infrastructure at LinkedIn

47

Workflows

Job A

Job B

Job C

Push to Production

Page 48: Data Infrastructure at LinkedIn

48

Workflows

Job A

Job B

Job C

Push to Production

Job X

Page 49: Data Infrastructure at LinkedIn

49

Workflows

Job A

Job B

Job C

Push to Production

Job X

Push to QA

Page 50: Data Infrastructure at LinkedIn

50

Real workflows are complicated

Page 51: Data Infrastructure at LinkedIn

51

Workflow management: Azkaban

Dependency management Diverse job types (Pig, Hive, Java, . . . ) Scheduling Monitoring Configuration Retry/restart on failure Resource locking Log collection Historical information

Page 52: Data Infrastructure at LinkedIn

52

Workflow management: Azkaban

Page 53: Data Infrastructure at LinkedIn

53

Workflow management: Azkaban

Page 54: Data Infrastructure at LinkedIn

Top complaints from data scientists

1 Getting the data in (Ingress ETL)

2 Getting the data out (Egress)

3 Workflow management

4 Model of computation

5 …

Page 55: Data Infrastructure at LinkedIn

Model of computation

• Alternating Direction Method of Multipliers (ADMM)• Distributed Conjugate Gradient Descent (DCGD)• Distributed L-BFGS• Bayesian Distributed Learning (BDL)

Graphs

Distributed learning

Near-line processing

Page 56: Data Infrastructure at LinkedIn

LinkedIn Confidential ©2013 All Rights Reserved 56

LinkedIn Data Infrastructure: A few take-aways

1. Building infrastructure in a hyper-growth environment is challenging.

2. Few vs Many: Balance over-specialized (agile) vs generic efforts (leverage-able) platforms (*)

3. Balance open-source products with home-grown platforms (**)

4. Data Model and Integration e2e are key (*)

Page 57: Data Infrastructure at LinkedIn

57

Learning more

data.linkedin.com