the data driven network real-time data at linkedin...kafka hadoop pinot architecture queries raw...

49
The Data Driven Network Kapil Surlaker Director of Engineering Real-Time Data at LinkedIn Kapil Surlaker and Shirshanka Das XLDB 2016

Upload: others

Post on 25-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

The Data Driven Network

Kapil Surlaker Director of Engineering

Real-Time Data at LinkedIn

Kapil Surlaker and Shirshanka Das XLDB 2016

Page 2: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

2

What does real-time data mean?

Real-time Ingestion

Stream Processing

Real-time Serving

Batch Processing

Page 3: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Real-time Data: Use Cases

In-product analytics Reporting Features for ML models (impression discounting) Standardization Monitoring, Alerting

Member-Facing

Business-Facing

Data

corp.linkedin.com

Transformlinkedin.com

Page 4: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Every pipeline looks like this

Ingest Process ServeCreate[ ]

Page 5: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

The Unified Metrics PlatformHow Single source of truth for metrics Centralized multi-tenant pipeline Easy on boarding

Why Disjointed efforts, unreliable systems Fragmented data pipelines across online and offline Unpredictable SLA across all systems Diminished trust in any dashboard

Page 6: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

WorkflowMetric

Definition

Sandbox

Code Repository

Metric Owner

System Jobs

Build

Core Metrics Job

Central Team, Relevant

Stakeholders

1. iterate

2. create 3. review

4. check in

Page 7: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Metric Definition

Name Description TagsOwners

Dataset

Dimensions

TimeScript

Metrics

Entity Ids

Tier

Formulas

Entity Dimensions

Input Datasets

Time windows

Page 8: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

UMP Data FlowUmp Monitor

Primary Data

(tracking, databases, external)

UMP Raw Data

UMP Aggregated

Data Relevance

Experiment analysis

Ad-hoc

Metrics Script

Data Prep agg cube

dimension verify

HDFS + Pinot

Dashboards

Page 9: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Ingest Process ServeCreate

Espresso Kafka Databus

Page 10: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Real-time data: DatabasesEspresso - key-document storage - secondary indexes - transactional updates

to related data - real-time change-

stream

Apps

Espresso DB

Databus

Downstream

Scale - XXX TB un-replicated - 1M+ qps, 0.1M+ wps - XXXX machines

Page 11: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Real-time Data: Tracking

Kafka (Tracking)Client-side Tracking

Tracking Frontend

Services

Downstream

Kafka - high-volume pub-sub

- tunable guarantees

- Scale - ~1.5T messages/day

- ~17M messages/sec

- XXX machines

Page 12: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

What about our acquisitions?

Slideshare

Bizo

Lynda

Where does this data live?

RESTSFTPJDBC

Page 13: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

What about other integrations?

• Salesforce • Google

Doubleclick • Responsys • Eloqua • …

Where does this data live?

RESTSFTP

JDBC

Page 14: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Ingest Process ServeCreate

Page 15: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Source Diversity

Batch +

StreamingData

Quality

Requirements

Page 16: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Gobblin Architecture

Page 17: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

17

Source

Work Unit

Work Unit

Work Unit

Extract

Extract

Extract

Convert

Convert

Convert

Quality

Quality

Quality

Write

Write

Write

Data Publish

Task

Task

Task

Page 18: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Solving for real-timeInefficiencies in batch

YARN based

Apache Helix

Continuous

Auto-scaling

YARN

Helix

Executor 1

Executor 2

Executor 3

HDFS

Stream Source

Page 19: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Current ActivityOpen source @ github.com/linkedin/gobblin

Adopted by LinkedIn, Intel, Swisscom, NerdWallet,…

@LinkedIn ~100 TB per day @ LinkedIn

Hundreds of datasets

~20 different sources

Under Development Metadata-driven

Hadoop-Hadoop copies

Page 20: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Ingest Process ServeCreate

Hadoop Spark Samza

Page 21: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Transformation engines @ LinkedIn

Page 22: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

@ LinkedInUse-cases

- Machine Learning

- Photon ML (Machine learning library on Spark)

- Much faster iteration times

- Larger feature sets

By the Numbers

- ~XXX machines

- ~XXXX unique flows

Improvement

Feeds relevance 2 h to 14 m (9 x)

Jobs relevance 32 m to 1.3 m (24 x)

Ads relevance 24 h to 45 m (32 x)

Communication relevance 18 h to 30 m (36 x) with 10 x more features

Page 23: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

@ LinkedInUse-cases

- Standardization

- Email optimization

- Site-speed

By the Numbers

- ~XXX machines

- ~XXX jobs

What

- Stream processing framework

- Apache YARN for resource allocation

- Apache Kafka for stream storage

- Local state optimizations using RocksDB

Page 24: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

There’s more out there…

Heron

2.0

Streams

Page 25: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Ingest Process ServeCreate

P not

Page 26: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Real-time. Interactive.

Page 27: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Slice and Dice metrics

Page 28: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Precompute!

Device Geo View

Android US 1

Android IN 1

iOS US 1

Dimension View

Android 2

iOS 1

US 2

IN 1

Android,US 1

iOS,US 1

Android,IN 1

Page 29: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

More dimensions!Device Geo Carrier View

Android US ATT 1

Android IN Reliance 1

iOS US Verizon 1

Dimension View

Android 2

iOS 1

US 2

IN 1

ATT 1

Reliance 1

Verizon 1

Android,US 1

... ...

Page 30: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

ChallengesHorizontally scalable Low latency Data freshness Fault tolerance OLAP features

Page 31: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

SQL-like interface

(minus joins)

Sub-second query latency

Data load from Hadoop

and Kafka

Capabilities

Page 32: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Pinot Data Flow

Kafka Hadoop

Samza Process

Pinot

minuteshour +

Page 33: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Pinot@LinkedIn

Site-­‐facing  Apps Reporting  dashboards Monitoring

In production since 2012 Open sourced in 2015 @ github.com/linkedin/pinot

Page 34: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

(S)QL: Filters and AggsSELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ AND action = 'stop'

Page 35: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

(S)QL: Group BySELECT count(*) FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND 'day' >= 15949 AND 'day' <= 15963 AND paid = 'y’ GROUP BY action

Page 36: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

(S)QL: ORDER BY and LIMITSELECT * FROM companyFollowHistoricalEvents WHERE entityId = 121011 AND entityId = 1000 AND action = 'start' ORDER BY creationTime DESC LIMIT 1

Page 37: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Broker Helix

Real time Historical

Kafka Hadoop

Pinot Architecture

Queries

Raw Data Samza

Page 38: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Multi-tenantDeclarative specification Role-specific assignment Seamless cluster expansion Powered by Apache Helix

“numCopies”: 2 }

{ “resourceName”: “MyStore”,“numDataNodes”: 4,“numBrokers”: 2,

Scale Largest cluster: xxx nodes # of stores: xxx

Page 39: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Columnar Storage

Page 40: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Forward Index

Page 41: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Fast but needs a ton of RAM

Page 42: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Single-node OptimizationsDisk-based structures Indexes: inverted, bitmap, hybrid File optimization Compression: dictionary, p4delta Multi-valued columns, skip lists Vectorization: ~7x latency improvement Query rewrite Sketching algorithms: HyperLogLog

Page 43: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

To pre-compute or not?

Page 44: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Data aware pre-computation

Page 45: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Speeding up the cycle

Form hypothesis Query Repeat

Page 46: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Form hypothesis Query Repeat

OR …

Breaking the cycle

Page 47: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless
Page 48: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Ingest Process ServeCreate

Hadoop Spark Samza

PinotEspresso Kafka Databus

Real-Time Data at LinkedIn

Page 49: The Data Driven Network Real-Time Data at LinkedIn...Kafka Hadoop Pinot Architecture Queries Raw Data Samza Multi-tenant Declarative specification Role-specific assignment Seamless

Kapil Surlaker @kapilsurlaker

49

Shirshanka Das @shirshanka

Thanks!