data storage infra at linkedin

67
Data Storage Infra at LinkedIn Yan Yan Staff software engineer

Upload: others

Post on 08-Jun-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Storage Infra at LinkedIn

Data Storage Infra at LinkedIn

Yan YanStaff software engineer

Page 2: Data Storage Infra at LinkedIn

Today’s Agenda

1. LinkedIn Overview

2. Data Infra at LinkedIn

3. Espresso – Distributed Document Store

4. Ambry – Distributed Object Store

5. Venice – Derived Data Platform

6. Summary

Page 3: Data Storage Infra at LinkedIn

546 million users> 100 million MAU

Over 200 countries

Page 4: Data Storage Infra at LinkedIn

ADVANCE MY CAREER

Get the right job

Page 5: Data Storage Infra at LinkedIn

ADVANCE MY CAREER

Build meaningfulrelationships

Page 6: Data Storage Infra at LinkedIn

ADVANCE MY CAREER

Establish & manage my reputation

Page 7: Data Storage Infra at LinkedIn

ADVANCE MY CAREER

Research & contact people

Page 8: Data Storage Infra at LinkedIn

ADVANCE MY CAREER

Stay well informed

Page 9: Data Storage Infra at LinkedIn

Data Infra at LinkedIn

Page 10: Data Storage Infra at LinkedIn

Common Data Patterns

Activity that should be reflected immediately

Online

Activity that should be reflected “soon”

Nearline

ETL processing –generally updated in

batches

Offline

Page 11: Data Storage Infra at LinkedIn

UI

Business Service Layer

Data Service LayerEvent Buffer

OfflineStorage

Online Data Storage

StreamingPipeline

Offline Pipeline ETL

Nearline Data Storage

CDC

Data Analytics Platform

Page 12: Data Storage Infra at LinkedIn

Espresso: Distributed Document StoreL i n k e d I n O n l i n e D a t a S o l u t i o n

Page 13: Data Storage Infra at LinkedIn

Why Espresso

Scalability vs Features-set

GAP• Consistency is important

• K-V model does not align with full application needs

• Full Oracle data model and query complexity not needed

• Build a new data storage which is consistent, scalable, indexed, richer than K-V

Oracle

Scalability

Feat

ure

Set

Voldemort

GapESPRESSO

Page 14: Data Storage Infra at LinkedIn

Espresso’s Design Goals

• Scalable and elastic

• Read after write Consistency

• Structure data with schema

• Secondary indies

• Transactional updates to inter-related data

• Multi-datacenter support

• Seamless integration with nearline and offline systems

Page 15: Data Storage Infra at LinkedIn

Espresso - Architecture

Client

RouterRouterRouter

Storage Node

API Server

MySQL ZK

Helix

Kafka Data Replicator

Storage Node

API Server

MySQL

Storage Node

API Server

MySQL

SnapshotService

BackupStorage Streaming

Process

Remote Data Center

Hadoop

Offline Data Center

OnlineNear line

Offline

Control

Data Center

Page 16: Data Storage Infra at LinkedIn

Espresso – Rest-ful API

Get GET /database/table/resource_id

Create PUT /database/table/resource_id {record}

Update POST /database/table/resource_id {field:value}

Hierarchy Get GET /database/table/resource_id/sub-resource_id

Query GET /database/table/resource_id?query=“field:pattern”

Page 17: Data Storage Infra at LinkedIn

Transactional Updates

Update records sharing the sameresource_id in different tables• Multipost

• /database/table1/id1 {field:value}

• /database/table2/id1/sub-id1 {field:value}

Page 18: Data Storage Infra at LinkedIn

Espresso – MySQL Mapping

Espresso DB

Table1

Table2

es_identity_1Table1

Table2

es_identity_2

es_identity_3

……

MySQLInstance1

es_identity_4Table1

Table2

es_identity_5

es_identity_6

……

MySQLInstance2

Distribute bypartition key

Page 19: Data Storage Infra at LinkedIn

Espresso – Data Distribution

Node 1

P1 P2

P3 P4

Node 3

P1 P2

P3 P4

Node 2

P1 P2

P3 P4

Master

Slave

Offline

P4:Master: Node1Slave: Node2

Node3

Node1Node2Node3

Helix

Live instances

External view

Page 20: Data Storage Infra at LinkedIn

Espresso – Cluster Expansion

Node1

P1 P2

P3 P4

Node3

P1 P2

P3 P4

Node2

P1 P2

P3 P4

P1:Master: Nod1Slave: Node2

Node3Offline: Node4

Node1Node2Node3Node4

Helix

Live instances

External view

Node4

P1 P2

P3

Page 21: Data Storage Infra at LinkedIn

Espresso – Node Failover

Node1

P2

P3 P4

Node3

P1

P3 P4

Node2

P1 P2

P4

P1:Master: Nod4Slave: Node2

Node3

Node1Node2Node3Node4

Helix

Live instances

External view

Node4

P1 P2

P3

Page 22: Data Storage Infra at LinkedIn

Espresso – Node Failover

Node1

P2

P3 P4

Node3

P1

P3 P4

Node2

P1 P2

P4

P1:Master: Nod4Slave: Node2

Node3

Node1Node2Node3Node4

Helix

Live instances

External view

Node4

P1 P2

P3

Page 23: Data Storage Infra at LinkedIn

Espresso – Node Failover

Node1

P2

P3 P4

Node3

P1

P3 P4

Node2

P1 P2

P4

P1:Master: Nod4Slave: Node2

Node1Node2Node4

Helix

Live instances

External view

Node4

P1 P2

P3

Page 24: Data Storage Infra at LinkedIn

Espresso – Node Failover

Node1

P2

P3 P4

Node2

P1 P2

P4

P1:Master: Nod4Slave: Node2

Node1Node2Node4

Helix

Live instances

External view

Node4

P1 P2

P3

P1

P4

P3

Kafka

Page 25: Data Storage Infra at LinkedIn

Ambry: Distributed Object StorageL i n k e d I n O n l i n e D a t a S o l u t i o n

Page 26: Data Storage Infra at LinkedIn

Object Storage Use Cases

Image, video, audio

Media

Docs, spreadsheets,slides

Documents

Database backup

Backup

JS, CSS, template

Static content

Page 27: Data Storage Infra at LinkedIn

Before Ambry

Media Server• Monolithic

• Not scalable

• No full control

• Expensive

Page 28: Data Storage Infra at LinkedIn

Ambry

Distributed object storage system• Immutable blobs

• Geo-distributed, horizontally scalable

• Unstructured data

• Multi-master

• Cost effective

Page 29: Data Storage Infra at LinkedIn

Ambry - Architecture

AmbryClient CDNs Http

Client

Http service

Routing service

Http service

Routing service

Http service

Routing service

ClusterManager

StorageSerivceStorageSerivceStorageService

AmbryClient CDNs Http

Client

Http service

Routing service

Http service

Routing service

Http service

Routing service

ClusterManager

StorageSerivceStorageSerivceStorageService

DataCenter1

DataCenter2

Cross-DCReplication

Page 30: Data Storage Infra at LinkedIn

Ambry - PUT Operation

Http service

Routing service

Client

StorageService

StorageService

StorageService

1

3

43

44

3

5

2

1. PUT data

2. Choose partition and Generateblob id

3. Write data to 3 replicas

4. Wait for at least 2 nodes torespond successfully

5. Reply blob id to client

Page 31: Data Storage Infra at LinkedIn

Ambry - GET Operation

Http service

Routing service

Client

StorageService

StorageService

StorageService

1

3

43

5

2

1. GET blob_id

2. Choose partition based onblob_id

3. Read from 2 replicas

4. Wait at least 1 node’s successfulresponse

5. Reply data to client

Page 32: Data Storage Infra at LinkedIn

Ambry - Large Blob PUT Operation

Large blob

blob1 blob2 … blobNMeta blob

blob_id1, blob_id2 … blob_idN

StorageService

StorageService

StorageService

StorageServiceStorageService

StorageServiceStorageService

StorageServiceStorageService

StorageServiceStorageServiceStorageService

Routing service

Client

Storage Service

Blob_id

1

2 3 N+1N+2

N+3

Page 33: Data Storage Infra at LinkedIn

Replication

• Multi-master replication

• Asynchronous

• Pull based

• Inter-colo and cross-coloreplication

Page 34: Data Storage Infra at LinkedIn

Ambry –Replication

…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

640 700 770 850 900

Log

Offset Blob id

640 50

700 30

770 70

850 40

Journal

Page 35: Data Storage Infra at LinkedIn

Ambry –Replication

…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

640 700 770 850 900

Node 1

…… BlobId:50

BlobId:30

BlobId:80

BlobId:90

640 700 770 890 940

Node 2

BlobsFrom Offset700

Blob Ids:30, 80, 90

Blob data:80,90

Get blobs80, 90

BlobId:80

BlobId:90

Page 36: Data Storage Infra at LinkedIn

Ambry –Replication

…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

640 700 770 850 900

Nod1

…… BlobId:50

BlobId:30

BlobId:80

BlobId:90

640 700 770 890 940

Node2

BlobsFrom Offset700

Blob Ids:30, 70, 40, 80, 90

Blob data:70, 40

Get blobs70, 40

BlobId:80

BlobId:90

BlobId:70

BlobId:40

Page 37: Data Storage Infra at LinkedIn

Venice: Derived Data PlatformL i n k e d I n N e a r l i n e + O f f l i n e D a t a S o l u t i o n

Page 38: Data Storage Infra at LinkedIn

Kinds of Data

• Source of Truth

• Example use case:

• Profile

• Example systems:

• SQL

• Document Stores

• K-V Stores

Primary Data Derived Data

• Derived from computing primary data

• Example use case:

• People You May Know

• Example systems:

• Search Indices

• Graph Databases

• K-V Stores

Page 39: Data Storage Infra at LinkedIn

Derived Data Lifecycle Today

Apps

Events Buffer

Offline Storage

Batch Jobs

Online Storage

StreamProcessing

Page 40: Data Storage Infra at LinkedIn

Lambda Architecture + Venice

StreamProcessing

BatchProcessing

App

Kafka

Hadoop

Page 41: Data Storage Infra at LinkedIn

FeaturesVenice

• Dataset versioning

• High throughput ingestion fromHadoop and Samza

• Automatic cluster management

• Multi-DC, Multi-Cluster, Multi-Tenant

• Run as a service

Page 42: Data Storage Infra at LinkedIn

Venice Data Model

Store AVersion 3

Partition 2

……

Store B

……

Partition 1

R1 R2 R3

V2

V1

• Store

• Version

• Partition

• Replica

• Record

• Avro

Page 43: Data Storage Infra at LinkedIn

StorageNode

Venice Components

Router

Samza

Client

Push JobHadoop

Controller

Push data flow

Metadata operation

Read data flow

Page 44: Data Storage Infra at LinkedIn

Venice Batch Mode

StorageNode

Hadoop

Azkaban Job

Map

Reduce

StorageNodeStorageNode

StorageNodeStorageNodeStorageNode

VeniceController

VeniceController

VeniceController

Kafka Cluster

......4

2 31

5 5

6

VeniceController

VeniceController

VeniceRouter

8

7

Page 45: Data Storage Infra at LinkedIn

Venice Version Swapping

RouterStorev7

Data Source Kafka Topics Venice Processes

Hadoop Storev8

Storev6

Push Job

Page 46: Data Storage Infra at LinkedIn

Venice Version Swapping

RouterStorev7

Data Source Kafka Topics Venice Processes

Hadoop Storev8

Storev6

Push Job

Page 47: Data Storage Infra at LinkedIn

Venice Hybrid Mode

• Merge batch and streaming data

• Minimize application complexity

• Multi-version support

Goals Write-time merge

• Hadoop writes into store-version topics

• Samza writes into a Real-Time Buffer topic (RTB)

• The RTB gets replayed into store-version topics

Page 48: Data Storage Infra at LinkedIn

Venice Hybrid Mode

RouterSamza Storev7

Data Sources Kafka Topics Venice Processes

Hadoop Storev8

Push Job

Page 49: Data Storage Infra at LinkedIn

Venice Hybrid Mode

RouterSamza Storev7

Data Sources Kafka Topics Venice Processes

Storev8

Hadoop

Page 50: Data Storage Infra at LinkedIn

Summary

• Document store

• Online data

• Get/Put/Transactional

• Expansion and failover

Espresso Ambry Venice

• Object store

• Online immutable data

• Get/Put/Large blob PUT

• Multi-master replication

• K-V store

• Derived data

• Get/Push

• Batch + streaming

Page 51: Data Storage Infra at LinkedIn

Learn more: engineering.linkedin.com/blog

Page 52: Data Storage Infra at LinkedIn

Back up slides

Page 53: Data Storage Infra at LinkedIn

Online Data

• Member Profile Update

• Post to a Group

• Social Gestures (Comment/Like/Share)

Page 54: Data Storage Infra at LinkedIn

Nearline Data

• Standardization

• Search Index Update

• Network Update Stream

Page 55: Data Storage Infra at LinkedIn

Offline Data

• People You May Know

• Who Viewed My Profile

• Jobs You May Be Interested In

Page 56: Data Storage Infra at LinkedIn

Why Espresso

• Difficult/expensive to run at Internet scale

• Structured data schema

• Strong consistency support

Oracle Voldemort

• Simpler data model (K,V)

• Write availability

• Eventual Consistency

• Scales well and cheaply

Page 57: Data Storage Infra at LinkedIn

Espresso - Architecture

Page 58: Data Storage Infra at LinkedIn

Espresso – Cross-DC Replication

StorageNode

ClientClientRouter

StorageNode

StorageNode

Kafka ClusterData Replicator

StorageNode

ClientClientRouter

StorageNode

StorageNode

Kafka ClusterData Replicator

Datacenter 1 Datacenter 2

Page 59: Data Storage Infra at LinkedIn

Cross-DC Replication

• Boomerang elimination

• Conflict resolution• Last write wins

• Unique id generation• User-selectable options

• Data consistency checker

Page 60: Data Storage Infra at LinkedIn

Ambry – Data Distribution

P1 P2 P3

Node1

P3 P1 P2

Node2

P2 P3 P1

Node3

Partit ion Status

1 Read-write

2 Read-write

3 Read-write

Page 61: Data Storage Infra at LinkedIn

Ambry – Cluster Expansion

P1 P2 P3

Node1

P3 P1 P2

Node2

P2 P3 P1

Node3

Partit ion Status

1 Read-only

2 Read-only

3 Read-write

4 Read-write

5 Read-write

6 Read-write

P4 P5 P6

Node4

P6 P4 P5

Node5

P5 P6 P4

Node6

Page 62: Data Storage Infra at LinkedIn

Index Segment3

Index Segment3

Ambry –Storage Layout

…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

400GB640 700 770 850 900

start offset in current index segment

log end offset

blob id offset TTL

id 30 700 ∞

id 40 850 1/1/16

id 70 770 ∞

Sorted byblob id

Index

Log

Start offset: 700End offset: 900

Index segment1

Index segment2Index segment3

Page 63: Data Storage Infra at LinkedIn

Storage Optimization

• O(1) I/O for writes

• Bloom filter for index segments

• Reply on OS page cache

• Zero copy for gets

Page 64: Data Storage Infra at LinkedIn

Read/Write APIVenice

• Derived data K-V store• Single Get

• Batch Get

• High throughput ingestion from:• Hadoop

• Samza

• Or both (hybrid)

Page 65: Data Storage Infra at LinkedIn

ScaleVenice

• Large scale• Multi-Datacenter

• Multi-Cluster

• Run “as a service”• Self-service onboarding

• Each cluster is multi-tenant

• Resource isolation

Page 66: Data Storage Infra at LinkedIn

TradeoffsVenice

All writes go through Kafka• Scalable

• Burst tolerant

• Asynchronous

• No native “read your writes” semantics

Page 67: Data Storage Infra at LinkedIn

Global Replication

Push Job

Controller

HadoopMirror Maker

Parent Controller

Datacenter Boundary

Storage Nodes

Mirror Maker

Mirror Maker