strata 2017 (san jose): building a healthy data ecosystem around kafka and hadoop: lessons learned...

Download Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

Post on 21-Mar-2017

145 views

Category:

Data & Analytics

0 download

Embed Size (px)

TRANSCRIPT

  • Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

    Mar 16, 2017

    Shirshanka Das, Principal Staff Engineer, LinkedIn Yael Garten, Director of Data Science, LinkedIn

    @shirshanka, @yaelgarten

  • The Pursuit of #DataScienceHappiness

    A original@yaelgarten @shirshanka

  • Achieve Data Democracy

    Data Scientists write code

    Unleash Insights

    Share Learnings at Strata!

    Three (Nave) Steps to #DataScienceHappinesscirca 2010

  • Achieve Data Democracy

    Data Scientists write code

    Unleash Insights

    Share Learnings at Strata!

    Three (Nave) Steps to #DataScienceHappinesscirca 2010

  • Achieving Data Democracy

    helping everybody to access and understand data . breaking down silos providing access to data when and where it is needed at any given moment.

    Collect, flow, store as much data as you can Provide efficient access to data in all its stages of evolution

  • The forms of data

    Key-Value++ Message Bus Fast-OLAP Search Graph Crunchable

    Espresso Venice

    Pinot Galene Graph DB

    Document DB

    DynamoDB

    Azure Blob, Data Lake Storage

  • The forms of data

    At Rest In Motion

    Espresso Venice

    Pinot

    Galene

    Graph DBDocument DB

    DynamoDB

    Azure Blob, Data Lake Storage

  • The forms of data

    At Rest In Motion

    Scale O(10) clusters ~1.7 Trillion messages ~450 TB

    Scale O(10) clusters ~10K machines ~100 PB

  • At RestIn Motion

    SFTPJDBCREST

    Data Integration

    Azure Blob, Data Lake Storage

  • Data Integration: key requirements

    Source, Sink Diversity

    Batch +

    StreamingData

    Quality

    So, we built

  • SFTP

    JDBCREST

    Simplifying Data Integration

    @LinkedIn

    Hundreds of TB per day

    Thousands of datasets

    ~30 different source systems

    80%+ of data ingest

    Open source @ github.com/linkedin/gobblin

    Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, NerdWallet and many more

    Apache incubation under way

    SFTP

    Azure Blob, Data Lake Storage

  • Query Engines

    At RestIn Motion

    Processing Frameworks

  • Processing Frameworks

  • Query Engines

    At RestIn Motion

    Processing Frameworks

  • Kafka Hadoop

    Samza Jobs

    Pinot

    minuteshour +

    Distributed Multi-dimensional OLAP Columnar + indexes No joins Latency: low ms to sub-second

    Query Engines

  • Site-facingApps Reportingdashboards Monitoring

    Open source. In production @ LinkedIn, Uber

  • At RestIn Motion

    Processing Frameworks

    Data Infra 1.0 for Data Democracy

    Query Engines

    2010 - now

  • Achieve Data Democracy

    Data Scientists write code

    Unleash Insights

    Share Learnings at Strata

  • How does LinkedIn build data-driven products?

    Data ScientistPM Designer

    Engineer

    We should enable users to filter connection suggestions by company

    How much do people utilize existing filter capabilities?

    Let's see how users send connection invitations today.

  • Tracking data records user activity

    InvitationClickEvent()

  • (powers metrics and data products)

    InvitationClickEvent()

    Scale fact: ~ 1000 tracking event types, ~ Hundreds of metrics & data products

    Tracking data records user activity

  • user engagement tracking data

    metric scripts

    production code

    Tracking Data Lifecycle

    TransportProduce Consume

    Member facing data products

    Business facing decision making

  • Tracking Data Lifecycle & Teams

    Product or App teams: PMs, Developers, TestEng

    Infra teams: Hadoop, Kafka, DWH, ...

    Data science teams: Analytics, ML Engineers,...

    user engagement tracking data

    metric scripts

    production code

    Member facing data products

    Business facing decision making

    TransportProduce Consume

    Members Execs

  • How do we calculate a metric: ProfileViewsPageViewEvent

    Record1:{"memberId":12345,"time":1454745292951,"appName":"LinkedIn","pageKey":"profile_page","trackingInfo":Viewee=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|..."}

    Metric: ProfileViews = sum(PageViewEvent where pagekey = profile_page )

    PageViewEvent

    Record101:{"memberId":12345,"time":1454745292951,"appName":"LinkedIn","pageKey":"new_profile_page","trackingInfo":"viewee_id=1214,lnl=f,nd=1,o=1214,^SP=pId-'pro_stars',rslvd=t,vs=v,vid=1214,ps=EDU|EXP|SKIL|..."}

    or new_profile_page

    Ok but forgot to notifyundesirable

  • Metrics ecosystem at LinkedIn: 3 yrs ago

    Operational Challenges for infra teams Diminished Trust due to multiple sources of truth

  • What was causing unhappiness?1. No contracts: Downstream scripts broke when upstream changed

    2. "Naming things is hard": different semantics & conventions in various data Events (per team) --> need to email to figure out what is correct and complete logic to use --> inefficient and potentially wrong

    3. Discrepant metric logic: Duplicate tech allowed for duplicate logic allowed for discrepant metric logic

    So how did we solve this?

  • Data Modeling Tip Say no to Fragile Formats or Schema-Free

    Invest in a mature serialization protocol like Avro, Protobuf, Thrift etc for serializing your messages to your persistent stores: Kafka, Hadoop, DBs etc.

    1. No contracts 2. Naming things is hard 3. Discrepant metric logic

    Chose Avro as our format

  • Sometimes you need a committee

    Leads from product and infra teams Review each new data model Ensure that it follows our conventions, patterns and best practices across entire data lifecycle

    1. No contracts 2. Naming things is hard 3. Discrepant metric logic

    Data Model Review Committee (DMRC)

    Tooling to codify conventions Always be reviewing

    Who and What Evolution

  • Unified Metrics PlatformA single source of truth for all business metrics at LinkedIn

    1. No contracts 2. Naming things is hard 3. Discrepant metric logic

    - metrics processing platform as a service - a metrics computation template - a set of tools and process to

    facilitate metrics life-cycle

    Central Team, Relevant Stakeholders

    Sandbox

    Metric Definition

    Code Repo

    Build & Deploy

    System JobsCore Metrics Job

    Metric Owner

    1. iterate

    2. create

    4. check in

    3. review

    5,000 metrics daily

  • Unified Metrics Platform: Pipeline

    Metrics Logic

    Raw Data

    Pinot

    UMP HarnessIncremental Aggregate Backfill Auto-join

    Raptor dashboards

    HDFS

    Aggregated Data

    Experiment

    Analysis

    Machine Learning

    Anomaly Detection

    HDFS

    Ad-hoc

    1. No contracts 2. Naming things is hard 3. Discrepant metric logic

    Tracking + Database + Other data

  • Tracking Platform: standardizing production

    Schema compatibility Time Audit

    KafkaClient-side Tracking

    Tracking Frontend

    Services

    Tools

  • Query Engines

    At RestIn Motion

    Processing Frameworks

    Data Infra + Platforms 2.0

    Pinot

    Tracking Platform Unified Metrics Platform (UMP)Production Consumption

    circa 2015

  • What was still causing unhappiness?1. Old bad data sticks around (e.g. old mobile app versions) 2. No clear contract for data production - Producers unaware of consumers concerns 3. Never a good time to pay down this tech debt

    We started from the bottom.

  • Product or App teams: PMs, Developers, TestEng

    Infra teams: Hadoop, Kafka, DWH, ...

    Data science teams: Analytics, ML Engineers,...

    user engagement tracking data

    metric scripts

    production code

    Member facing data products

    Business facing decision making

    Members Execs

    3. Never a good time to pay down this "data" debt

    #victimsOfTheData > #DataScienceHappiness via proactively forging our own data destiny.

    Features are waiting to ship to members... some of this stuff is invisible But what is the cost of not doing it?

  • The Big Problem Opportunity in 2015 Launch a completely rewritten LinkedIn mobile app

  • PageViewEvent {"header":{"memberId":12345,"time":1454745292951,"appName":{"string":"LinkedIn""pageKey":"profile_page"},},"trackingInfo":{["Viewee":"23456"], ...}}

    We already wanted to move to better data models

    ProfileViewEvent

    {"header":{"memberId":12345,"time":4745292951145,"appName":{"string":"LinkedIn""pageKey":"profile_page"},},"entityView":{"viewType":"profile-view","viewerId":12345,

    "vieweeId":23456,},}

    viewee_ID

  • 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from

    scratch), b. Save: consumers avoid migrating.

    2. Evolve. a. Cost: time on clean data modeling, and on consumer

    migration to new tracking events, b. Save: pays down data modeling tech debt

    There were two options:

  • 1. Keep the old tracking: a. Cost: producers (try to) replicate it (write bad old code from

    scratch), b. Save: consumers avoid migrating.

    2. Evolve. a. Cost: time on clean data modeling, and on consumer

    migration to new tracking events, b. Save: pays down data modeling tech debt

    How much work would it be?

    #D