building a healthy data ecosystem around kafka and hadoop: lessons learned at linkedin

63
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn Mar 16, 2017 Shirshanka Das, Principal Staff Engineer, LinkedIn Yael Garten, Director of Data Science, LinkedIn @shirshanka, @yaelgarten

Upload: yael-garten

Post on 05-Apr-2017

307 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

Mar 16, 2017

Shirshanka Das, Principal Staff Engineer, LinkedIn Yael Garten, Director of Data Science, LinkedIn

@shirshanka, @yaelgarten

The Pursuit of #DataScienceHappiness

A original@yaelgarten @shirshanka

Achieve Data Democracy

Data Scientists write code

Unleash Insights

Share Learnings at Strata!

Three (Naïve) Steps to #DataScienceHappinesscirca 2010

Achieve Data Democracy

Data Scientists write code

Unleash Insights

Share Learnings at Strata!

Three (Naïve) Steps to #DataScienceHappinesscirca 2010

Achieving Data Democracy

“… helping everybody to access and understand data .… breaking down silos… providing access to data when and where it is needed at any given moment.”

Collect, flow, store as much data as you can Provide efficient access to data in all its stages of evolution

The forms of data

Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable

Espresso Venice

Pinot Galene Graph DB

Document DB

DynamoDB

Azure Blob, Data Lake Storage

The forms of data

At Rest In Motion

Espresso Venice

Pinot

Galene

Graph DBDocument DB

DynamoDB

Azure Blob, Data Lake Storage

The forms of data

At Rest In Motion

Scale O(10) clusters ~1.7 Trillion messages ~450 TB

Scale O(10) clusters ~10K machines ~100 PB

At RestIn Motion

SFTPJDBCREST

Data Integration

Azure Blob, Data Lake Storage

Data Integration: key requirements

Source, Sink Diversity

Batch +

StreamingData

Quality

So, we built

SFTP

JDBCREST

Simplifying Data Integration

@LinkedIn

Hundreds of TB per day

Thousands of datasets

~30 different source systems

80%+ of data ingest

Open source @ github.com/linkedin/gobblin

Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, NerdWallet and many more…

Apache incubation under way

SFTP

Azure Blob, Data Lake Storage

Query Engines

At RestIn Motion

Processing Frameworks

Processing Frameworks

Query Engines

At RestIn Motion

Processing Frameworks

Kafka Hadoop

Samza Jobs

Pinot

minuteshour +

Distributed Multi-dimensional OLAP Columnar + indexes No joins Latency: low ms to sub-second

Query Engines

Site-facingApps Reportingdashboards Monitoring

Open source. In production @ LinkedIn, Uber

At RestIn Motion

Processing Frameworks

Data Infra 1.0 for Data Democracy

Query Engines

2010 - now

Achieve Data Democracy

Data Scientists write code

Unleash Insights

Share Learnings at Strata

How does LinkedIn build data-driven products?

Data ScientistPM Designer

Engineer

We should enable users to filter connection suggestions by company

How much do people utilize existing filter capabilities?

Let's see how users send connection invitations today.

Tracking data records user activity

InvitationClickEvent()

(powers metrics and data products)

InvitationClickEvent()

Scale fact: ~ 1000 tracking event types, ~ Hundreds of metrics & data products

Tracking data records user activity

user engagement tracking data

metric scripts

production code

Tracking Data Lifecycle

TransportProduce Consume

Member facing data products

Business facing decision making

Tracking Data Lifecycle & Teams

Product or App teams: PMs, Developers, TestEng

Infra teams: Hadoop, Kafka, DWH, ...

Data science teams: Analytics, ML Engineers,...

user engagement tracking data

metric scripts

production code

Member facing data products

Business facing decision making

TransportProduce Consume

Members Execs

How do we calculate a metric: ProfileViewsPageViewEvent

Record1:{"memberId":12345,"time":1454745292951,"appName":"LinkedIn","pageKey":"profile_page","trackingInfo":“Viewee=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|..."}

Metric: ProfileViews = sum(PageViewEvent where pagekey = profile_page )

PageViewEvent

Record101:{"memberId":12345,"time":1454745292951,"appName":"LinkedIn","pageKey":"new_profile_page","trackingInfo":"viewee_id=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|..."}

or new_profile_page

Ok but forgot to notifyundesirable

Metrics ecosystem at LinkedIn: 3 yrs ago

Operational Challenges for infra teams Diminished Trust due to multiple sources of truth

What was causing unhappiness?1. No contracts: Downstream scripts broke when upstream changed

2. "Naming things is hard": different semantics & conventions in various data Events (per team) --> need to email to figure out what is correct and complete logic to use --> inefficient and potentially wrong

3. Discrepant metric logic: Duplicate tech allowed for duplicate logic allowed for discrepant metric logic

So how did we solve this?

Data Modeling Tip Say no to Fragile Formats or Schema-Free

Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing your messages to your persistent stores: Kafka, Hadoop, DBs etc.

1. No contracts 2. Naming things is hard 3. Discrepant metric logic

Chose Avro as our format

Sometimes you need a committee

Leads from product and infra teams Review each new data model Ensure that it follows our conventions, patterns and best practices across entire data lifecycle

1. No contracts 2. Naming things is hard 3. Discrepant metric logic

Data Model Review Committee (DMRC)

Tooling to codify conventions “Always be reviewing”

Who and What Evolution

Unified Metrics PlatformA single source of truth for all business metrics at LinkedIn

1. No contracts 2. Naming things is hard 3. Discrepant metric logic

- metrics processing platform as a service - a metrics computation template - a set of tools and process to

facilitate metrics life-cycle

Central Team, Relevant Stakeholders

Sandbox

Metric Definition

Code Repo

Build & Deploy

System JobsCore Metrics Job

Metric Owner

1. iterate

2. create

4. check in

3. review

5,000 metrics daily

Unified Metrics Platform: Pipeline

Metrics Logic

Raw Data

Pinot

UMP HarnessIncremental Aggregate Backfill Auto-join

Raptor dashboards

HDFS

Aggregated Data

ExperimentAnalysis

Machine Learning

Anomaly Detection

HDFS

Ad-hoc

1. No contracts 2. Naming things is hard 3. Discrepant metric logic

Tracking + Database + Other data

Tracking Platform: standardizing production

Schema compatibility Time Audit

KafkaClient-side Tracking

Tracking Frontend

Services

Tools

Query Engines

At RestIn Motion

Processing Frameworks

Data Infra + Platforms 2.0

Pinot

Tracking Platform Unified Metrics Platform (UMP)Production Consumption

circa 2015

What was still causing unhappiness?1. Old bad data sticks around (e.g. old mobile app versions) 2. No clear contract for data production - Producers unaware of consumers concerns 3. Never a good time to pay down this tech debt

We started from the bottom.

Product or App teams: PMs, Developers, TestEng

Infra teams: Hadoop, Kafka, DWH, ...

Data science teams: Analytics, ML Engineers,...

user engagement tracking data

metric scripts

production code

Member facing data products

Business facing decision making

Members Execs

3. Never a good time to pay down this "data" debt

#victimsOfTheData —> #DataScienceHappiness via proactively forging our own data destiny.

Features are waiting to ship to members... some of this stuff is invisible But what is the cost of not doing it?

The Big Problem Opportunity in 2015 Launch a completely rewritten LinkedIn mobile app

PageViewEvent {"header":{"memberId":12345,"time":1454745292951,"appName":{"string":"LinkedIn""pageKey":"profile_page"},},"trackingInfo":{["Viewee":"23456"], ...}}

We already wanted to move to better data models

ProfileViewEvent

{"header":{"memberId":12345,"time":4745292951145,"appName":{"string":"LinkedIn""pageKey":"profile_page"},},"entityView":{"viewType":"profile-view","viewerId":“12345”,

"vieweeId":“23456”,},}

viewee_ID

1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from

scratch), b. Save: consumers avoid migrating.

2. Evolve. a. Cost: time on clean data modeling, and on consumer

migration to new tracking events, b. Save: pays down data modeling tech debt

There were two options:

1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from

scratch), b. Save: consumers avoid migrating.

2. Evolve. a. Cost: time on clean data modeling, and on consumer

migration to new tracking events, b. Save: pays down data modeling tech debt

How much work would it be?

#DataScienceHappiness

We pitched it to our Leadership team

Do it!

CTOCEO

2. Clear contract did not exist for data productionProducers were unaware of consumers needs, and were "Throwing data over the wall". Albeit avro, Schema adherence != Semantics equivalence

user engagement tracking data

metric scripts

productioncode

Member facingdata products

Business facing decision making

#victimsOfTheData —> #DataScienceHappiness, via proactive joint requirements definition

Own the artifact that feeds the data ecosystem (and data scientists!)

Data producers (PM, app developers)

Data consumers (DS)

2a. Ensure dialogue between Producers & Consumers• Awareness: Train about end-to-end data pipeline, data modeling • Instill communication & collaborative ownership process between all: a step-by-step

playbook for who & how to develop and own tracking

2b. Standardized core data entities• Event types and names: Page, Action, Impression

• Framework level client side tracking: views, clicks, flows

• For all else (custom) - guide when to create a new Event

Navigation

Page View

Control Interaction

2c. Created clear maintainable data production contracts

Tracking specification with monitoring and alerting for adherence: clear, visual, consistent contract

Need tooling to support culture and process shift - "Always be tooling"

Tracking specification Tool

1. Old bad data sticks around

PageViewEvent

{ "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } }

ProfileViewEvent

{ "header" : { "memberId" : 12345, "time" : 4745292951145, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "entityView" : { "viewType" : "profile-view", "viewerId" : “12345”,

"vieweeId" : “23456”, }, }

How do we handle old and new?PageViewEvent

ProfileViewEvent

Producers Consumers

old

new

Relevance

Analytics

The Big Challenge

load “/data/tracking/PageViewEvent” using AvroStorage()

(Pig scripts)

My Raw Data

Our scripts were doing ….

My Raw DataMy Data API

We need “microservices" for Data

The Database community solved this decades ago...

Views!

We built Dali to solve this

A Data Access Layer for Linkedin

Abstract away underlying physical details to allow users to

focus solely on the logical concerns

Logical Tables + Views

Logical FileSystem

Solving With Views

Producers

LinkedInProfileView

PageViewEvent

ProfileViewEventnew

old

Consumers

pagekey==

profile

1:1

Relevance

Analytics

Views ecosystem

51

Producers Consumers

LinkedInProfileView

JSAProfileViewJob Seeker App (JSA)

LinkedIn App

UnifiedProfileView

Dali: Implementation Details in Context

Dali FileSystem

Processing Engine (MR, Spark)

Dali Datasets (Tables+Views)

Dataflow APIs (MR, Spark, Scalding)

Query Layers (Pig, Hive, Spark)

Dali CLI

Data Catalog

Git + Artifactory

View Def + UDFs

Dataset Owner

Data SourceData Sink

From

load ‘/data/tracking/PageViewEvent’ using AvroStorage();

To

load ‘tracking.UnifiedProfileView’ using DaliStorage();

One small step for a script

A Few Hard Problems

Versioning Views and UDFs

Mapping to Hive metastore entities Development lifecycle

Git as source of truth

Gradle for build LinkedIn tooling integration for deployment

State of the world today

~300 views

Pretty much all new UMP metrics use Dali data sources

ProfileViews MessagesSent Searches InvitationsSent ArticlesRead JobApplications ...

At Rest

Data Processing Frameworks

Now brewing: Dali on Kafka

Can we take the same

views and run them

seamlessly on Kafka as

well?

Stream Data

Standard streaming API-s - Samza System Consumer - Kafka Consumer

What’s next for Dali?

Selective materialization

Open source

Hive is an implementation detail, not a long term bet

Dali: When are we done dreaming?

At RestIn Motion

Data Processing Frameworks

Dali

Query Engines

At RestIn Motion

Processing Frameworks

Data Infra + Platforms 3.0

Pinot

Tracking Platform Unified Metrics Platform (UMP)

DaliDr Elephant WhereHows

circa 2017

Did we succeed? We just handled another huge rewrite!

#DataScienceHappiness

Achieve Data Democracy

Data Scientists write code

Unleash Insights

Share Learnings at Strata

Three (Naïve) Steps to #DataScienceHappiness

Basic data infrastructure for data democracy

Platforms, Process to standardize produce + consume

Evangelize investing

in #DataScience

Happiness

Tech + processto sustain

healthy data ecosystem

Our Journey towards #DataScienceHappiness

Dali, Dialogue 2015->

Tracking, UMP DMRC 2013 ->

Kafka, Hadoop, Gobblin, Pinot 2010 -> 2015 ->

The Pursuit of #DataScienceHappiness

A original@yaelgarten @shirshanka

Thank You!

to be continued…