data storage infra at linkedin

Data Storage Infra at LinkedIn

Yan YanStaff software engineer

Today’s Agenda

1. LinkedIn Overview

2. Data Infra at LinkedIn

3. Espresso – Distributed Document Store

4. Ambry – Distributed Object Store

5. Venice – Derived Data Platform

6. Summary

546 million users> 100 million MAU

Over 200 countries

ADVANCE MY CAREER

Get the right job

ADVANCE MY CAREER

Build meaningfulrelationships

ADVANCE MY CAREER

Establish & manage my reputation

ADVANCE MY CAREER

Research & contact people

ADVANCE MY CAREER

Stay well informed

Data Infra at LinkedIn

Common Data Patterns

Activity that should be reflected immediately

Online

Activity that should be reflected “soon”

Nearline

ETL processing –generally updated in

batches

Offline

Business Service Layer

Data Service LayerEvent Buffer

OfflineStorage

Online Data Storage

StreamingPipeline

Offline Pipeline ETL

Nearline Data Storage

Data Analytics Platform

Espresso: Distributed Document StoreL i n k e d I n O n l i n e D a t a S o l u t i o n

Why Espresso

Scalability vs Features-set

GAP• Consistency is important

• K-V model does not align with full application needs

• Full Oracle data model and query complexity not needed

• Build a new data storage which is consistent, scalable, indexed, richer than K-V

Oracle

Scalability

Voldemort

GapESPRESSO

Espresso’s Design Goals

• Scalable and elastic

• Read after write Consistency

• Structure data with schema

• Secondary indies

• Transactional updates to inter-related data

• Multi-datacenter support

• Seamless integration with nearline and offline systems

Espresso - Architecture

Client

RouterRouterRouter

Storage Node

API Server

MySQL ZK

Kafka Data Replicator

Storage Node

API Server

Storage Node

API Server

SnapshotService

BackupStorage Streaming

Process

Remote Data Center

Hadoop

Offline Data Center

OnlineNear line

Offline

Control

Data Center

Espresso – Rest-ful API

Get GET /database/table/resource_id

Create PUT /database/table/resource_id {record}

Update POST /database/table/resource_id {field:value}

Hierarchy Get GET /database/table/resource_id/sub-resource_id

Query GET /database/table/resource_id?query=“field:pattern”

Transactional Updates

Update records sharing the sameresource_id in different tables• Multipost

• /database/table1/id1 {field:value}

• /database/table2/id1/sub-id1 {field:value}

Espresso – MySQL Mapping

Espresso DB

Table1

Table2

es_identity_1Table1

Table2

es_identity_2

es_identity_3

……

MySQLInstance1

es_identity_4Table1

Table2

es_identity_5

es_identity_6

……

MySQLInstance2

Distribute bypartition key

Espresso – Data Distribution

Node 1

Node 3

Node 2

Master

Offline

P4:Master: Node1Slave: Node2

Node1Node2Node3

Live instances

External view

Espresso – Cluster Expansion

P1:Master: Nod1Slave: Node2

Node3Offline: Node4

Node1Node2Node3Node4

Live instances

External view

Espresso – Node Failover

Live instances

External view

Live instances

External view

Node1Node2Node4

Live instances

External view

Node1Node2Node4

Live instances

External view

Ambry: Distributed Object StorageL i n k e d I n O n l i n e D a t a S o l u t i o n

Object Storage Use Cases

Image, video, audio

Docs, spreadsheets,slides

Documents

Database backup

Backup

JS, CSS, template

Static content

Before Ambry

Media Server• Monolithic

• Not scalable

• No full control

• Expensive

Distributed object storage system• Immutable blobs

• Geo-distributed, horizontally scalable

• Unstructured data

• Multi-master

• Cost effective

Ambry - Architecture

AmbryClient CDNs Http

Client

Http service

Routing service

Http service

Routing service

Http service

Routing service

ClusterManager

StorageSerivceStorageSerivceStorageService

AmbryClient CDNs Http

Client

Http service

Routing service

Http service

Routing service

Http service

Routing service

ClusterManager

StorageSerivceStorageSerivceStorageService

DataCenter1

DataCenter2

Cross-DCReplication

Ambry - PUT Operation

Http service

Routing service

Client

StorageService

1. PUT data

2. Choose partition and Generateblob id

3. Write data to 3 replicas

4. Wait for at least 2 nodes torespond successfully

5. Reply blob id to client

Ambry - GET Operation

Http service

Routing service

Client

StorageService

1. GET blob_id

2. Choose partition based onblob_id

3. Read from 2 replicas

4. Wait at least 1 node’s successfulresponse

5. Reply data to client

Ambry - Large Blob PUT Operation

Large blob

blob1 blob2 … blobNMeta blob

blob_id1, blob_id2 … blob_idN

StorageService

StorageServiceStorageService

StorageServiceStorageServiceStorageService

Routing service

Client

Storage Service

Blob_id

2 3 N+1N+2

Replication

• Multi-master replication

• Asynchronous

• Pull based

• Inter-colo and cross-coloreplication

Ambry –Replication

…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

640 700 770 850 900

Offset Blob id

640 50

700 30

770 70

850 40

Journal

…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

640 700 770 850 900

Node 1

…… BlobId:50

BlobId:30

BlobId:80

BlobId:90

640 700 770 890 940

Node 2

BlobsFrom Offset700

Blob Ids:30, 80, 90

Blob data:80,90

Get blobs80, 90

BlobId:80

BlobId:90

…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

640 700 770 850 900

…… BlobId:50

BlobId:30

BlobId:80

BlobId:90

640 700 770 890 940

BlobsFrom Offset700

Blob Ids:30, 70, 40, 80, 90

Blob data:70, 40

Get blobs70, 40

BlobId:80

BlobId:90

BlobId:70

BlobId:40

Venice: Derived Data PlatformL i n k e d I n N e a r l i n e + O f f l i n e D a t a S o l u t i o n

Kinds of Data

• Source of Truth

• Example use case:

• Profile

• Example systems:

• SQL

• Document Stores

• K-V Stores

Primary Data Derived Data

• Derived from computing primary data

• Example use case:

• People You May Know

• Example systems:

• Search Indices

• Graph Databases

• K-V Stores

Derived Data Lifecycle Today

Events Buffer

Offline Storage

Batch Jobs

Online Storage

StreamProcessing

Lambda Architecture + Venice

StreamProcessing

BatchProcessing

Hadoop

FeaturesVenice

• Dataset versioning

• High throughput ingestion fromHadoop and Samza

• Automatic cluster management

• Multi-DC, Multi-Cluster, Multi-Tenant

• Run as a service

Venice Data Model

Store AVersion 3

Partition 2

……

Store B

……

Partition 1

R1 R2 R3

• Store

• Version

• Partition

• Replica

• Record

• Avro

StorageNode

Venice Components

Router

Client

Push JobHadoop

Controller

Push data flow

Metadata operation

Read data flow

Venice Batch Mode

StorageNode

Hadoop

Azkaban Job

Reduce

StorageNodeStorageNode

StorageNodeStorageNodeStorageNode

VeniceController

Kafka Cluster

......4

VeniceController

VeniceRouter

Venice Version Swapping

RouterStorev7

Data Source Kafka Topics Venice Processes

Hadoop Storev8

Storev6

Push Job

Venice Version Swapping

RouterStorev7

Data Source Kafka Topics Venice Processes

Hadoop Storev8

Storev6

Push Job

Venice Hybrid Mode

• Merge batch and streaming data

• Minimize application complexity

• Multi-version support

Goals Write-time merge

• Hadoop writes into store-version topics

• Samza writes into a Real-Time Buffer topic (RTB)

• The RTB gets replayed into store-version topics

Venice Hybrid Mode

RouterSamza Storev7

Data Sources Kafka Topics Venice Processes

Hadoop Storev8

Push Job

Venice Hybrid Mode

RouterSamza Storev7

Data Sources Kafka Topics Venice Processes

Storev8

Hadoop

Summary

• Document store

• Online data

• Get/Put/Transactional

• Expansion and failover

Espresso Ambry Venice

• Object store

• Online immutable data

• Get/Put/Large blob PUT

• Multi-master replication

• K-V store

• Derived data

• Get/Push

• Batch + streaming

Learn more: engineering.linkedin.com/blog

Back up slides

Online Data

• Member Profile Update

• Post to a Group

• Social Gestures (Comment/Like/Share)

Nearline Data

• Standardization

• Search Index Update

• Network Update Stream

Offline Data

• People You May Know

• Who Viewed My Profile

• Jobs You May Be Interested In

Why Espresso

• Difficult/expensive to run at Internet scale

• Structured data schema

• Strong consistency support

Oracle Voldemort

• Simpler data model (K,V)

• Write availability

• Eventual Consistency

• Scales well and cheaply

Espresso - Architecture

Espresso – Cross-DC Replication

StorageNode

ClientClientRouter

StorageNode

Kafka ClusterData Replicator

StorageNode

ClientClientRouter

StorageNode

Kafka ClusterData Replicator

Datacenter 1 Datacenter 2

Cross-DC Replication

• Boomerang elimination

• Conflict resolution• Last write wins

• Unique id generation• User-selectable options

• Data consistency checker

Ambry – Data Distribution

P1 P2 P3

P3 P1 P2

P2 P3 P1

Partit ion Status

1 Read-write

2 Read-write

3 Read-write

Ambry – Cluster Expansion

P1 P2 P3

P3 P1 P2

P2 P3 P1

Partit ion Status

1 Read-only

2 Read-only

3 Read-write

4 Read-write

5 Read-write

6 Read-write

P4 P5 P6

P6 P4 P5

P5 P6 P4

Index Segment3

Ambry –Storage Layout

…… BlobId:50

BlobId:30

BlobId:70

BlobId:40

400GB640 700 770 850 900

start offset in current index segment

log end offset

blob id offset TTL

id 30 700 ∞

id 40 850 1/1/16

id 70 770 ∞

Sorted byblob id

Start offset: 700End offset: 900

Index segment1

Index segment2Index segment3

Storage Optimization

• O(1) I/O for writes

• Bloom filter for index segments

• Reply on OS page cache

• Zero copy for gets

Read/Write APIVenice

• Derived data K-V store• Single Get

• Batch Get

• High throughput ingestion from:• Hadoop

• Samza

• Or both (hybrid)

ScaleVenice

• Large scale• Multi-Datacenter

• Multi-Cluster

• Run “as a service”• Self-service onboarding

• Each cluster is multi-tenant

• Resource isolation

TradeoffsVenice

All writes go through Kafka• Scalable

• Burst tolerant

• Asynchronous

• No native “read your writes” semantics

Global Replication

Push Job

Controller

HadoopMirror Maker

Parent Controller

Datacenter Boundary

Storage Nodes

Mirror Maker

data storage infra at linkedin

Documents

linkedin for business - cloud object storage | store ... ·...

clio infra

linkedin on linkedin

vblock infra

kridhan infra ltd.19may15 - rakesh...

infra design

reliance infra

h.g. infra engineering limited. infra engineering... ·...

wave infra

edelweiss infra

beyond - welcome to smc infra llc (smc) | smc infra...

infra structure

firewall infra

risk analysis and simulation for geologic storage …...risk...

infra [ initiatives ]

roads infra

infra financing

rich infra

infra report

trans infra