the evolution of data infrastructure at linkedin linkedin confidential ©2013 all rights reserved

39
The Evolution of Data Infrastructure at Linkedin Lei Gao http://www.linkedin.com/in/gaolei LinkedIn Confidential ©2013 All Rights Reserved

Upload: ann-nicholson

Post on 17-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

LinkedIn Confidential ©2013 All Rights Reserved

The Evolution of Data Infrastructure at Linkedin

Lei Gaohttp://www.linkedin.com/in/gaolei

Page 2: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

LinkedIn Confidential ©2013 All Rights Reserved 2

Outline

1. Company and Mission

2. Products and Science

3. Data Infrastructure

4. Conclusion

Page 3: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

The World’s Largest Professional Network

Members Worldwide

2+ newMembers Per Second

132M+Monthly Unique Visitors

225M+ 2.9M+ Company Pages

Connecting the world’s professionals to make them more productive and successful

LinkedIn Confidential ©2013 All Rights Reserved 3

Page 4: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

4

Member ProfilesLarge dataset

Medium writes

Very high reads

Freshness <1s

Page 5: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

5

People You May KnowLarge dataset

Compute intensive

High reads

Freshness ~hrs

Page 6: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

6

LinkedIn Today Moving dataset

High writes

High reads

Freshness ~mins

Page 7: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

LinkedIn Confidential ©2013 All Rights Reserved 7

LinkedIn Data Infrastructure: Three-Phase Abstraction

Users Online Data Infra

Near-Line Infra

Application Offline Data Infra

Infrastructure Latency & Freshness Requirements Products

Online Activity that should be reflected immediately• Member Profiles• Company Profiles• Connections

• Messages • Endorsements• Skills

Near-Line Activity that should be reflected soon

• Activity Streams• Profile Standardization• News

• Recommendations• Search• Messages

Offline Activity that can be reflected later

• People You May Know• Connection Strength• News

• Recommendations• Next best idea…

Page 8: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

LinkedIn Confidential ©2013 All Rights Reserved 8

The Big-Data Feedback Loop

Value

Insights

Scale

Product

ScienceData

Member

Engagement

Virality

Signals

Refinement

InfrastructureAnalytics

Page 9: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

9

LinkedIn Data Infrastructure: Sample Stack

Infra challenges in 3-phase ecosystem are diverse, complex and specific

Some off-the-shelf.Significant investment in home-grown, deep and

interesting platforms

Databus

Page 10: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

LinkedIn Confidential ©2013 All Rights Reserved 10

The Original RDBMS Model

Page 11: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

11

Streaming Transactions for Search/Connections

Page 12: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

Databus : Timeline-Consistent Change Data Capture

LinkedIn Data Infrastructure Solutions

Page 13: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

13

Streaming Transactions for Search/Connections

RO

RO

RO

Page 14: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

Databus at LinkedIn

14

DB

Bootstrap

CaptureChanges

On-lineChanges

On-lineChanges

DB

Compressed

Delta Since T

Consistent

Snapshot at U

Transport independent of data source: Oracle, MySQL, …

Transactional semantics In order, at least once delivery

Tens of relays Hundreds of sources Low latency - milliseconds

Consumer 1

Consumer n

Client

Dat

abus

C

lient

Lib

Consumer 1

Consumer n

Dat

abus

C

lient

Lib

Client

Relay

Event Win

Page 15: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

15

Scaling Core Databases

RO

RO

RO

Page 16: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

16

Voldemort: Highly-Available Distributed KV Store

LinkedIn Data Infrastructure Solutions

Page 17: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

17

Scaling Core Databases

Page 18: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

• Pluggable components• Tunable consistency /

availability• Highly scalable key/value store

• 14 clusters, 400 nodes• 400K peak QPS• 100TB data• 2~3ms avg latency

Voldemort: Architecture

Page 19: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

19

Scaling Core Databases

Secondary Index

Page 20: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

20

Espresso: Indexed Timeline-Consistent Distributed Data Store

LinkedIn Data Infrastructure Solutions

Page 21: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

21

Storage with Richer Data Model

Espresso

Page 22: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

Application View

22

Hierarchical data model

Rich functionality on resources Conditional updates Partial updates Atomic counters

Rich functionality withinresource groups

Transactions Secondary index Text search

Page 23: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

23

Espresso: System Components

• Partitioning/replication• Timeline consistency• Change propagation

Page 24: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

24

Generic Cluster Manager: Helix

• Generic Distributed State Model• Config Management• Automatic Load Balancing• Fault tolerance• Cluster expansion and rebalancing

• Espresso, Databus and Search• Open Source Apr 2012• https://github.com/linkedin/helix

Page 25: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

25

Streaming Non-transactional Events

Hadoop/DW

Espresso

Page 26: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

26

Kafka: High-Volume Low-Latency Messaging System

LinkedIn Data Infrastructure Solutions

Page 27: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

27

Ingress – Offline Data Analytics

SecuredHadoop/

DW

Page 28: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

Kafka Architecture

Producer

Consumer

Producer

Consumer

Zookeeper

topic1-part1

topic2-part2

topic2-part1

topic1-part2

topic2-part2

topic2-part1

topic1-part1 topic1-part2

topic1-part1 topic1-part2

topic2-part2

topic2-part1

Broker 1 Broker 2 Broker 3 Broker 4

Key features• Scale-out architecture• High throughput• Automatic load balancing• Intra-cluster replication

Per day stats• writes: 10+ billion messages• reads: 50+ billion messages

Page 29: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

29

Egress – Analytics Results for Online Serving

SecuredHadoop/

DW

Page 30: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

30

WebHDFS + Faust

LinkedIn Data Infrastructure Solutions

+

Page 31: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

31

Egress – Getting Data Out from Offline

SecuredHadoop/

DW

WebHDFS

KafkaFaust

Page 32: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

32

Batch Environment Data Flow

Page 33: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

33

Workflow management: Azkaban

Page 34: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

LinkedIn Confidential ©2013 All Rights Reserved 34

• Map-reduce jobs generate RO files• All index fits in memory for fast reads• File system cache for data

• Data transferred in parallel via WebHDFS

• Authentication always required for each file transfer out of Hadoop

Read-only Data Generation and Transfer

Page 35: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

LinkedIn Confidential ©2013 All Rights Reserved 35

• Map-reduce jobs generate records• In Avro format• Annotated key and value fields

• Records published from Hadoop to Kakfa

• Faust consumes records from Kafka

• Faust streams records into Voldemort, Espresso, and other serving platforms

Modifiable Data Generation and Transfer

Plug-ins

V. Plug-in

E. Plug-in

Plug-ins

Kafka Plug-

in

Databus

Plug-in

Other Data Sources

Voldemort

Espresso

Other Data Sources

Hadoop

Teradata/ DWH

Kafka

Monitoring Throttling Scheduling

Faust

Page 36: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

LinkedIn Confidential ©2013 All Rights Reserved 36

Summary

Read more @ data.linkedin.com

1. E2E: The Big-Data feedback loop is essential for product design

2. Infrastructure

1. Data Infra needs continuous innovation and iteration to scale out

2. Fast moving, Big, Clean Data + Agile Metadata = Goodness

3. Data-driven products need agile feedback infrastructure and measurement methodology.

3. Methodology

1. Data-Driven experimentation enables insights and agile products

2. Recommendation-driven products have big impact.

Page 37: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

LinkedIn Confidential ©2013 All Rights Reserved 37

Help us. Come Have Fun with Us!

Info: data.linkedin.com

1. Science and Data Mining: Recommendation and Optimization Problems

2. Next-generation ad-hoc and OLAP query processing on Hadoop

3. Graph Computations: Off-line mining and On-line integration loops

4. nRT Data Streams in Near-line infrastructure

5. And much more…

Page 38: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

LinkedIn Confidential ©2013 All Rights Reserved 38

In Closing

[email protected]

Thank You!

Page 39: The Evolution of Data Infrastructure at Linkedin LinkedIn Confidential ©2013 All Rights Reserved

39