strata sg 2015: linkedin self serve reporting platform on hadoop

43
Self-Serve Reporting Platform on Hadoop Shirshanka Das Strata Singapore 2015

Upload: shirshanka-das

Post on 26-Jan-2017

1.699 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Self-Serve Reporting Platform on Hadoop

Shirshanka Das Strata Singapore 2015

Page 2: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Page 3: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Page 4: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

4

Ingest Process Serve Visualize

Reporting Pipelines

Page 5: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

5

Ingest Process Serve Visualize

Reporting at LinkedIn: EvolutionSources

Oracle MSTR

Tableau

Internal Tools

Espresso

Kafka

External

Custom

Custom

Custom

Hadoop Voldemort

Pinot

MySQL

INFA + MSTR OracleTeradataon+ Scripts

Jobs on

Page 6: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Infra Scale

6

Number of Hadoop clusters: 12 Total number of machines: ~7k Largest Cluster: ~3k machines

Data volume generated per day: XX Terabytes Total accumulated data: XX Petabytes

Page 7: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

People Scale

7

Reporting Platform Team: ~10 Core Warehouse Team: 1x

Data Scientists: 10x Business Analysts: 10x Product Managers: 10x

Sales and Marketing: 100x

Page 8: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

8

Ingest Process Serve Visualize

Challenges

Disjointed efforts, unreliable systems Unpredictable SLA across all systems

Fragmented data pipelines with inconsistent data

Page 9: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

9

Ingest Process Serve Visualize

Page 10: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Houston we have a problem

Page 11: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Step 1 Central transport pipeline

Page 12: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Still have a problem

Page 13: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Step 2

Central Ingestion

Framework

13

Page 14: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Stream + Batch

RESTSFTP

JDBC

Diverse Sources

Open source @ github.com/linkedin/gobblin In production @ LinkedIn, Intel, Swisscom, NerdWallet

@LinkedIn ~20 distinct source types Hundreds of TB per day Hundreds of datasets

Data Quality

Page 15: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

15

Ingest Process Serve Visualize

Unified Metrics Platform

Page 16: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

16

Single Source

of Truth

Easy Onboarding Operability

Requirements

Page 17: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

WorkflowMetric

Definition

Sandbox

Code Repository

Metric Owner

System Jobs

Build

Core Metrics Job

Central Team, Relevant

Stakeholders

1. iterate

2. create 3. review

4. check in

Page 18: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Metric Definition

Name Description TagsOwners

Dataset

Dimensions

TimeScript

Metrics

Entity Ids

Tier

Formulas

Entity Dimensions

Input Datasets

Temporality

Page 19: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

An example: video play analysisname: "video"

description: “Metrics for video tracking”

label: “video”

tags: [flagship, feed]

owners: [jdoe, jsmith]

enabled: true

retention: 90d

timestamp: timestamp

frequency: daily

script: video_play.pig

output_window: 1d

dimensions:[ { name: platform doc: “phone, tablet or desktop" } { name: action_type doc: “click play or auto-play“ } ]

input_datasets [

{ name: actionsRaw path: Tracking.ActionEvent range: 1d }

]

Page 20: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

An example contd…metrics: [ name: unique_viewers

doc: “Count of unique viewers” formula: “unique(member_id)”

tier: 2 good_direction: "up" } { name: play_actions doc: “Sum of play actions" tier: 2 formula: “sum(play_actions)" good_direction: "up" } ]

entity_ids: [ {

name: member_id category: member } { name:video_id category: video }

]

Page 21: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

UMP Data FlowUmp Monitor

Primary Data

(tracking, databases, external)

UMP Raw Data

UMP Aggregated

Data Relevance

Experiment analysis

Ad-hoc

Metrics Script

Data Prep agg cube

dimension verify

HDFS + Pinot

Dashboards

Page 22: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

First version in production since early 2014 Significant redesign in 2015

Total amount of data being scanned per day: Hundreds of TBs Total number of metrics being computed: 2k+ Total number of scripts: ~ 400 Number of authors for these metrics: ~ 200 Maximum number of dimensions per dataset: ~ 30 Number of people responsible for upkeep of pipeline: 2

UMP by the numbers

22

Page 23: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Learnings so far

23

Ease of onboarding Hard when you have > 1000 users with different skill sets Need great UX to complement developer friendly alternatives

Single source of truth Not just a technology challenge Organization needs to rally around it

Operability Multi-tenant Hadoop pipeline with SLA-s and QoS: hard Cost 2 Serve: Managing metrics lifecycle is important

The Next Big Things Bridging streaming and batch Code-free metrics Sessions, Funnels, Cohorts Open source

Page 24: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

24

Ingest Process Serve Visualize

P not

Page 25: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

SQL-like interface

(minus joins)

Sub second query latency

Data load from Hadoop

and Kafka

Capabilities

Page 26: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Pinot Data Flow

Kafka Hadoop

Samza Process

Pinot

minuteshour +

Page 27: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Pinot@LinkedIn

Site-­‐facing  Apps Reporting  dashboards Monitoring

In production since 2012 Open source @ github.com/linkedin/pinot

Page 28: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

28

Ingest Process Serve Visualize

Raptor

Page 29: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Standardize Visualization

29

Leverage- Standalone app, with support for embedding - Can use existing analytics backend: Pinot

Strategic- Reduces dependency on 3rd party BI tools - Closer integration with LinkedIn’s ecosystem of

experimentation, anomaly detection solutions

Page 30: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

30

Requirements

Support apps

ecosystem

Core Visualization Capabilities

Metadata Integration

Page 31: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

Raptor 1.0

31

First version built by 3 engineers in a quarter Features - Integration with UMP, Pinot - Time series, bar charts, … - Create, Publish, Clone, Discover

Dashboards Numbers - Number of dashboards: ~100 - Weekly unique users: ~400

Page 32: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Page 33: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Page 34: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Page 35: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Page 36: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Page 37: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Page 38: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Page 39: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

The Future for Raptor

39

Social Collaboration features Intelligence - Anomaly detection - Dashboards You May Like

Embedding into data products Open Source

Page 40: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

40

Ingest Process Serve Visualize

A Few Good Hammers

Unified Metrics Platform

P not Raptor

Page 41: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

41

Ingest Process Serve Visualize

What we’re excited about

Unified Metrics Platform

P not Raptor

Metadata Bus

Page 42: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

42

Metadata driven e2e Optimizations

Dynamic prioritization of data ingest Surface source data quality issues in dashboard Surface backfill status on dashboard Cascading deprecation of dashboards, computation and data sources through lineage

Page 43: Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop

43

Shirshanka Das @shirshanka

Catch me offline to chat about…

What we’re doing for - Views on Hadoop - Data Quality - Metadata