the data driven network real-time data at linkedin...kafka hadoop pinot architecture queries raw...

The Data Driven Network

Kapil Surlaker Director of Engineering

Real-Time Data at LinkedIn

Kapil Surlaker and Shirshanka Das XLDB 2016

What does real-time data mean?

Real-time Ingestion

Stream Processing

Real-time Serving

Batch Processing

Real-time Data: Use Cases

In-product analytics Reporting Features for ML models (impression discounting) Standardization Monitoring, Alerting

Member-Facing

Business-Facing

corp.linkedin.com

Transformlinkedin.com

Every pipeline looks like this

Ingest Process ServeCreate[ ]

The Unified Metrics PlatformHow Single source of truth for metrics Centralized multi-tenant pipeline Easy on boarding

Why Disjointed efforts, unreliable systems Fragmented data pipelines across online and offline Unpredictable SLA across all systems Diminished trust in any dashboard

WorkflowMetric

Definition

Sandbox

Code Repository

Metric Owner

System Jobs

Core Metrics Job

Central Team, Relevant

Stakeholders

1. iterate

2. create 3. review

4. check in

Metric Definition

Name Description TagsOwners

Dataset

Dimensions

TimeScript

Metrics

Entity Ids

Formulas

Entity Dimensions

Input Datasets

Time windows

UMP Data FlowUmp Monitor

Primary Data

(tracking, databases, external)

UMP Raw Data

UMP Aggregated

Data Relevance

Experiment analysis

Ad-hoc

Metrics Script

Data Prep agg cube

dimension verify

HDFS + Pinot

Dashboards

Ingest Process ServeCreate

Espresso Kafka Databus

Real-time data: DatabasesEspresso - key-document storage - secondary indexes - transactional updates

to related data - real-time change-

stream

Espresso DB

Databus

Downstream

Scale - XXX TB un-replicated - 1M+ qps, 0.1M+ wps - XXXX machines

Real-time Data: Tracking

Kafka (Tracking)Client-side Tracking

Tracking Frontend

Services

Downstream

Kafka - high-volume pub-sub

- tunable guarantees

- Scale - ~1.5T messages/day

- ~17M messages/sec

- XXX machines

What about our acquisitions?

Slideshare

Where does this data live?

RESTSFTPJDBC

What about other integrations?

• Salesforce • Google

Doubleclick • Responsys • Eloqua • …

Where does this data live?

RESTSFTP

Source Diversity

Batch +

StreamingData

Quality

Requirements

Gobblin Architecture

Source

Work Unit

Extract

Convert

Quality

Data Publish

Solving for real-timeInefficiencies in batch

YARN based

Apache Helix

Continuous

Auto-scaling

Executor 1

Executor 2

Executor 3

Stream Source

Current ActivityOpen source @ github.com/linkedin/gobblin

Adopted by LinkedIn, Intel, Swisscom, NerdWallet,…

@LinkedIn ~100 TB per day @ LinkedIn

Hundreds of datasets

~20 different sources

Under Development Metadata-driven

Hadoop-Hadoop copies

Hadoop Spark Samza

Transformation engines @ LinkedIn

@ LinkedInUse-cases

- Machine Learning

- Photon ML (Machine learning library on Spark)

- Much faster iteration times

- Larger feature sets

By the Numbers

- ~XXX machines

- ~XXXX unique flows

Improvement

Feeds relevance 2 h to 14 m (9 x)

Jobs relevance 32 m to 1.3 m (24 x)

Ads relevance 24 h to 45 m (32 x)

Communication relevance 18 h to 30 m (36 x) with 10 x more features

@ LinkedInUse-cases

- Standardization

- Email optimization

- Site-speed

By the Numbers

- ~XXX machines

- ~XXX jobs

- Stream processing framework

- Apache YARN for resource allocation

- Apache Kafka for stream storage

- Local state optimizations using RocksDB

There’s more out there…

Streams

Real-time. Interactive.

Slice and Dice metrics

Precompute!

Device Geo View

Android US 1

Android IN 1

iOS US 1

Dimension View

Android 2

Android,US 1

iOS,US 1

Android,IN 1

More dimensions!Device Geo Carrier View

Android US ATT 1

Android IN Reliance 1

iOS US Verizon 1

Dimension View

Android 2

Reliance 1

Verizon 1

Android,US 1

... ...

ChallengesHorizontally scalable Low latency Data freshness Fault tolerance OLAP features

SQL-like interface

(minus joins)

Sub-second query latency

Data load from Hadoop

and Kafka

Capabilities

Pinot Data Flow

Kafka Hadoop

Samza Process

minuteshour +

Pinot@LinkedIn

Site-‐facing Apps Reporting dashboards Monitoring

In production since 2012 Open sourced in 2015 @ github.com/linkedin/pinot

(S)QL: Filters and AggsSELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ AND action = 'stop'

(S)QL: Group BySELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ GROUP BY action

(S)QL: ORDER BY and LIMITSELECT * FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND entityId = 1000 AND action = 'start' ORDER BY creationTime DESC LIMIT 1

Broker Helix

Real time Historical

Kafka Hadoop

Pinot Architecture

Queries

Raw Data Samza

Multi-tenantDeclarative specification Role-specific assignment Seamless cluster expansion Powered by Apache Helix

“numCopies”: 2 }

{ “resourceName”: “MyStore”,“numDataNodes”: 4,“numBrokers”: 2,

Scale Largest cluster: xxx nodes # of stores: xxx

Columnar Storage

Forward Index

Fast but needs a ton of RAM

Single-node OptimizationsDisk-based structures Indexes: inverted, bitmap, hybrid File optimization Compression: dictionary, p4delta Multi-valued columns, skip lists Vectorization: ~7x latency improvement Query rewrite Sketching algorithms: HyperLogLog

To pre-compute or not?

Data aware pre-computation

Speeding up the cycle

Form hypothesis Query Repeat

OR …

Breaking the cycle

Hadoop Spark Samza

PinotEspresso Kafka Databus

Real-Time Data at LinkedIn

Kapil Surlaker @kapilsurlaker

Shirshanka Das @shirshanka

Thanks!

the data driven network real-time data at linkedin...kafka hadoop pinot architecture queries raw...

Documents

memory. kinds of memory declarative episodic-semantic ...

samza la hug

declarative programming & algebraic data types from django's...

discovering data-aware declarative process models from...

samza: stateful scalable stream processing at...

semantic conversion of transport data adopting declarative

declarative data analytics: a survey · the area of...

stateful stream processing with kafka and samza

declarative xml data cleaning with xclean...declarative xml...

declarative interaction design for data visualization ·...

boom analytics: exploring data-centric,declarative...

foundations of declarative data analysis using limit...

discovering data-aware declarative process models … ›...

declarative big data analysis for high-energy physics...

building real-time data products at linkedin with apache...

architectures for massive data management - apache kafka,...

some declarative approaches to data quality · some...

declarative thinking, declarative practice

cloud based big data platforms and new profession of · pdf...

spchains: a declarative framework for data stream processing...