Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

Download Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

Post on 21-Mar-2017

145 views

Category:

Data & Analytics

0 download

TRANSCRIPT

  • Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

    Mar 16, 2017

    Shirshanka Das, Principal Staff Engineer, LinkedIn Yael Garten, Director of Data Science, LinkedIn

    @shirshanka, @yaelgarten

  • The Pursuit of #DataScienceHappiness

    A original@yaelgarten @shirshanka

  • Achieve Data Democracy

    Data Scientists write code

    Unleash Insights

    Share Learnings at Strata!

    Three (Nave) Steps to #DataScienceHappinesscirca 2010

  • Achieve Data Democracy

    Data Scientists write code

    Unleash Insights

    Share Learnings at Strata!

    Three (Nave) Steps to #DataScienceHappinesscirca 2010

  • Achieving Data Democracy

    helping everybody to access and understand data . breaking down silos providing access to data when and where it is needed at any given moment.

    Collect, flow, store as much data as you can Provide efficient access to data in all its stages of evolution

  • The forms of data

    Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable

    Espresso Venice

    Pinot Galene Graph DB

    Document DB

    DynamoDB

    Azure Blob, Data Lake Storage

  • The forms of data

    At Rest In Motion

    Espresso Venice

    Pinot

    Galene

    Graph DBDocument DB

    DynamoDB

    Azure Blob, Data Lake Storage

  • The forms of data

    At Rest In Motion

    Scale O(10) clusters ~1.7 Trillion messages ~450 TB

    Scale O(10) clusters ~10K machines ~100 PB

  • At RestIn Motion

    SFTPJDBCREST

    Data Integration

    Azure Blob, Data Lake Storage

  • Data Integration: key requirements

    Source, Sink Diversity

    Batch +

    StreamingData

    Quality

    So, we built

  • SFTP

    JDBCREST

    Simplifying Data Integration

    @LinkedIn

    Hundreds of TB per day

    Thousands of datasets

    ~30 different source systems

    80%+ of data ingest

    Open source @ github.com/linkedin/gobblin

    Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, NerdWallet and many more

    Apache incubation under way

    SFTP

    Azure Blob, Data Lake Storage

  • Query Engines

    At RestIn Motion

    Processing Frameworks

  • Processing Frameworks

  • Query Engines

    At RestIn Motion

    Processing Frameworks

  • Kafka Hadoop

    Samza Jobs

    Pinot

    minuteshour +

    Distributed Multi-dimensional OLAP Columnar + indexes No joins Latency: low ms to sub-second

    Query Engines

  • Site-facingApps Reportingdashboards Monitoring

    Open source. In production @ LinkedIn, Uber

  • At RestIn Motion

    Processing Frameworks

    Data Infra 1.0 for Data Democracy

    Query Engines

    2010 - now

  • Achieve Data Democracy

    Data Scientists write code

    Unleash Insights

    Share Learnings at Strata

  • How does LinkedIn build data-driven products?

    Data ScientistPM Designer

    Engineer

    We should enable users to filter connection suggestions by company

    How much do people utilize existing filter capabilities?

    Let's see how users send connection invitations today.

  • Tracking data records user activity

    InvitationClickEvent()

  • (powers metrics and data products)

    InvitationClickEvent()

    Scale fact: ~ 1000 tracking event types, ~ Hundreds of metrics & data products

    Tracking data records user activity

  • user engagement tracking data

    metric scripts

    production code

    Tracking Data Lifecycle

    TransportProduce Consume

    Member facing data products

    Business facing decision making

  • Tracking Data Lifecycle & Teams

    Product or App teams: PMs, Developers, TestEng

    Infra teams: Hadoop, Kafka, DWH, ...

    Data science teams: Analytics, ML Engineers,...

    user engagement tracking data

    metric scripts

    production code

    Member facing data products

    Business facing decision making

    TransportProduce Consume

    Members Execs

  • How do we calculate a metric: ProfileViewsPageViewEvent

    Record1:{"memberId":12345,"time":1454745292951,"appName":"LinkedIn","pageKey":"profile_page","trackingInfo":Viewee=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|..."}

    Metric: ProfileViews = sum(PageViewEvent where pagekey = profile_page )

    PageViewEvent

    Record101:{"memberId":12345,"time":1454745292951,"appName":"LinkedIn","pageKey":"new_profile_page","trackingInfo":"viewee_id=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|..."}

    or new_profile_page

    Ok but forgot to notifyundesirable

  • Metrics ecosystem at LinkedIn: 3 yrs ago

    Operational Challenges for infra teams Diminished Trust due to multiple sources of truth

  • What was causing unhappiness?1. No contracts: Downstream scripts broke when upstream changed

    2. "Naming things is hard": different semantics & conventions in various data Events (per team) --> need to email to figure out what is correct and complete logic to use --> inefficient and potentially wrong

    3. Discrepant metric logic: Duplicate tech allowed for duplicate logic allowed for discrepant metric logic

    So how did we solve this?

  • Data Modeling Tip Say no to Fragile Formats or Schema-Free

    Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing your messages to your persistent stores: Kafka, Hadoop, DBs etc.

    1. No contracts 2. Naming things is hard 3. Discrepant metric logic

    Chose Avro as our format

  • Sometimes you need a committee

    Leads from product and infra teams Review each new data model Ensure that it follows our conventions, patterns and best practices across entire data lifecycle

    1. No contracts 2. Naming things is hard 3. Discrepant metric logic

    Data Model Review Committee (DMRC)

    Tooling to codify conventions Always be reviewing

    Who and What Evolution

  • Unified Metrics PlatformA single source of truth for all business metrics at LinkedIn

    1. No contracts 2. Naming things is hard 3. Discrepant metric logic

    - metrics processing platform as a service - a metrics computation template - a set of tools and process to

    facilitate metrics life-cycle

    Central Team, Relevant Stakeholders

    Sandbox

    Metric Definition

    Code Repo

    Build & Deploy

    System JobsCore Metrics Job

    Metric Owner

    1. iterate

    2. create

    4. check in

    3. review

    5,000 metrics daily

  • Unified Metrics Platform: Pipeline

    Metrics Logic

    Raw Data

    Pinot

    UMP HarnessIncremental Aggregate Backfill Auto-join

    Raptor dashboards

    HDFS

    Aggregated Data

    Experiment

    Analysis

    Machine Learning

    Anomaly Detection

    HDFS

    Ad-hoc

    1. No contracts 2. Naming things is hard 3. Discrepant metric logic

    Tracking + Database + Other data

  • Tracking Platform: standardizing production

    Schema compatibility Time Audit

    KafkaClient-side Tracking

    Tracking Frontend

    Services

    Tools

  • Query Engines

    At RestIn Motion

    Processing Frameworks

    Data Infra + Platforms 2.0

    Pinot

    Tracking Platform Unified Metrics Platform (UMP)Production Consumption

    circa 2015

  • What was still causing unhappiness?1. Old bad data sticks around (e.g. old mobile app versions) 2. No clear contract for data production - Producers unaware of consumers concerns 3. Never a good time to pay down this tech debt

    We started from the bottom.

  • Product or App teams: PMs, Developers, TestEng

    Infra teams: Hadoop, Kafka, DWH, ...

    Data science teams: Analytics, ML Engineers,...

    user engagement tracking data

    metric scripts

    production code

    Member facing data products

    Business facing decision making

    Members Execs

    3. Never a good time to pay down this "data" debt

    #victimsOfTheData > #DataScienceHappiness via proactively forging our own data destiny.

    Features are waiting to ship to members... some of this stuff is invisible But what is the cost of not doing it?

  • The Big Problem Opportunity in 2015 Launch a completely rewritten LinkedIn mobile app

  • PageViewEvent {"header":{"memberId":12345,"time":1454745292951,"appName":{"string":"LinkedIn""pageKey":"profile_page"},},"trackingInfo":{["Viewee":"23456"], ...}}

    We already wanted to move to better data models

    ProfileViewEvent

    {"header":{"memberId":12345,"time":4745292951145,"appName":{"string":"LinkedIn""pageKey":"profile_page"},},"entityView":{"viewType":"profile-view","viewerId":12345,

    "vieweeId":23456,},}

    viewee_ID

  • 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from

    scratch), b. Save: consumers avoid migrating.

    2. Evolve. a. Cost: time on clean data modeling, and on consumer

    migration to new tracking events, b. Save: pays down data modeling tech debt

    There were two options:

  • 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from

    scratch), b. Save: consumers avoid migrating.

    2. Evolve. a. Cost: time on clean data modeling, and on consumer

    migration to new tracking events, b. Save: pays down data modeling tech debt

    How much work would it be?

    #DataSc

    ienceHa

    ppiness

  • We pitched it to our Leadership team

    Do it!

    CTOCEO

  • 2. Clear contract did not exist for data productionProducers were unaware of consumers needs, and were "Throwing data over the wall". Albeit avro, Schema adherence != Semantics equivalence

    user engagement tracking data

    metric scripts

    productioncode

    Member facingdata products

    Business facing decision making

    #victimsOfTheData > #DataScienceHappiness, via proactive joint requirements definition

    Own the artifact that feeds the data ecosystem (and data scientists!)

    Data producers (PM, app developers)

    Data consumers (DS)

  • 2a. Ensure dialogue between Producers & Consumers Awareness: Train about end-to-end data pipeline, data modeling Instill communication & collaborative ownership process between all: a step-by-step

    playbook for who & how to develop and own tracking

  • 2b. Standardized core data entities Event types and names: Page, Action, Impression

    Framework level client side tracking: views, clicks, flows

    For all else (custom) - guide when to create a new Event

    Navigation

    Page View

    Control Interaction

  • 2c. Created clear maintainable data production contracts

    Tracking specification with monitoring and alerting for adherence: clear, visual, consistent contract

    Need tooling to support culture and process shift - "Always be tooling"

    Tracking specification Tool

  • 1. Old bad data sticks around

    PageViewEvent

    { "header" : { "memberId" : 12345, "time" : 1454745292951, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "trackingInfo" : { ["vieweeID" : "23456"], ... } }

    ProfileViewEvent

    { "header" : { "memberId" : 12345, "time" : 4745292951145, "appName" : { "string" : "LinkedIn" "pageKey" : "profile_page" }, }, "entityView" : { "viewType" : "profile-view", "viewerId" : 12345,

    "vieweeId" : 23456, }, }

  • How do we handle old and new?PageViewEvent

    ProfileViewEvent

    Producers Consumers

    old

    new

    Relevance

    Analytics

  • The Big Challenge

    load /data/tracking/PageViewEvent using AvroStorage()

    (Pig scripts)

    My Raw Data

    Our scripts were doing .

  • My Raw DataMy Data API

    We need microservices" for Data

  • The Database community solved this decades ago...

    Views!

  • We built Dali to solve this

    A Data Access Layer for Linkedin

    Abstract away underlying physical details to allow users to

    focus solely on the logical concerns

    Logical Tables + Views

    Logical FileSystem

  • Solving With Views

    Producers

    LinkedInProfileView

    PageViewEvent

    ProfileViewEventnew

    old

    Consumers

    pagekey==

    profile

    1:1

    Relevance

    Analytics

  • Views ecosystem

    51

    Producers Consumers

    LinkedInProfileView

    JSAProfileViewJob Seeker App (JSA)

    LinkedIn App

    UnifiedProfileView

  • Dali: Implementation Details in Context

    Dali FileSystem

    Processing Engine (MR, Spark)

    Dali Datasets (Tables+Views)

    Dataflow APIs (MR, Spark, Scalding)

    Query Layers (Pig, Hive, Spark)

    Dali CLI

    Data Catalog

    Git + Artifactory

    View Def + UDFs

    Dataset Owner

    Data SourceData Sink

  • From

    load /data/tracking/PageViewEvent using AvroStorage();

    To

    load tracking.UnifiedProfileView using DaliStorage();

    One small step for a script

  • A Few Hard Problems

    Versioning Views and UDFs

    Mapping to Hive metastore entities Development lifecycle

    Git as source of truth

    Gradle for build LinkedIn tooling integration for deployment

  • State of the world today

    ~300 views

    Pretty much all new UMP metrics use Dali data sources

    ProfileViews MessagesSent Searches InvitationsSent ArticlesRead JobApplications ...

    At Rest

    Data Processing Frameworks

  • Now brewing: Dali on Kafka

    Can we take the same

    views and run them

    seamlessly on Kafka as

    well?

    Stream Data

    Standard streaming API-s - Samza System Consumer - Kafka Consumer

  • Whats next for Dali?

    Selective materialization

    Open source

    Hive is an implementation detail, not a long term bet

  • Dali: When are we done dreaming?

    At RestIn Motion

    Data Processing Frameworks

    Dali

  • Query Engines

    At RestIn Motion

    Processing Frameworks

    Data Infra + Platforms 3.0

    Pinot

    Tracking Platform Unified Metrics Platform (UMP)

    DaliDr ElephantWhereHows

    circa 2017

  • Did we succeed? We just handled another huge rewrite!

    #DataSc

    ienceHa

    ppiness

  • Achieve Data Democracy

    Data Scientists write code

    Unleash Insights

    Share Learnings at Strata

    Three (Nave) Steps to #DataScienceHappiness

  • Basic data infrastructure for data democracy

    Platforms, Process to standardize produce + consume

    Evangelize investing

    in #DataScience

    Happiness

    Tech + processto sustain

    healthy data ecosystem

    Our Journey towards #DataScienceHappiness

    Dali, Dialogue 2015->

    Tracking, UMP DMRC 2013 ->

    Kafka, Hadoop, Gobblin, Pinot 2010 -> 2015 ->

  • The Pursuit of #DataScienceHappiness

    A original@yaelgarten @shirshanka

    Thank You!

    to be continued