Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Download Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Post on 08-Jan-2017

7.498 views

Category:

Technology

3 download

TRANSCRIPT

  • Real-time Data Integration with Apache

    Flink & Kafka @Bouygues Telecom

    Mohamed Amine ABDESSEMED

    Flink Forward, Berlin, October 2015

  • About Me

    Software engineer & solution architect @ Bouygues

    Telecom

    My daily Toolbox

    Hadoop ecosystem

    Apache Flink

    Apache Kafka

    Elasticsearch

    Apache Camel

    And more.

    If I dont see me coding, I am probably outside running

    2

  • Outline

    Who we are

    Logged User eXperience

    Challenges

    Typical Data Flow pipeline on Hadoop

    Real-time Data Integration

    Apache Flink : The Elegant

    Data Integration use case

    What we loved using Flink

    Whats Next ?

    3

  • BOUYGUES TELECOM

    14M Clients

    11,4M Mobile

    subscriber

    2,6M Fixed customer

    A very Innovative company

    Leader 4G/4G+/UHMD

    First Android based TV BOX

    4

    Mobile . Fixed . TV . Internet . Cloud

  • BOUYGUES TELECOM

    2135819137

    14479

    10595

    BouyguesTelecom

    Orange Free SFR

    nPerf Mobile Data Networks Global score 2G/3G/4G (Q3-

    2015)

    Bouygues Telecom

    Orange

    Free

    SFR

    71% 72%

    33%

    53%

    BouyguesTelecom

    Orange Free SFR

    Population covered in 4G

    Bouygues Telecom Orange Free SFR

    5

  • AT THE HUB OF OUR 14 MILLION CUSTOMERS' DIGITAL

    LIVES

    AND WE GIVE THEMGENUINEREASONS TO STAY LOYAL

  • LUX: Logged User eXperienceMobile QoE

    Produce Mobile QoE indicators from massive

    network equipments event logs (4 Billions/day).

    Goals:

    QoE (User) instead of QoS (Machine).

    Real-time Diagnostic (

  • LUX: Logged User eXperienceMobile QoE

    log

    s

    log

    s

    log

    s

    log

    s

    log

    s

    log

    s

    log

    s

    log

    s

    8

    Challenges

  • Challenges

    1. Data movement

    9

    log

    s

    log

    s

    log

    s

    log

    s

    log

    s

    log

    s

    log

    s

    log

    s

  • Challenges

    1. Data movement

    2. Data Processing

    Data is generally too raw to be used directly.

    How can we transform it ?

    How can we make the results available as soon as possible ?

    10

  • Typical Data Flow pipeline on Hadoop

    HADOOP

    FTP

    HTTP

    SCOOP

    FLUME

    FS PUT

    Client

    1

    Client

    2

    Client

    ...

    Client

    x

    Impala

    Hive

    Hbase

    TO

    System 1

    System 2

    System ...

    System x

    Raw

    Data

    DFS

    Enriched

    Data

    V1

    DFS

    Enriched

    Data

    V...

    DFS

    Enriched

    Data

    Vx

    DFS

    Data

    source 1

    Data

    Source 2

    Data

    Source x

    Data

    Source ...

    Batch Batch Batch

    11

  • 12

    THE WORLD MOVES FAST

    DATA MUST MOVE FASTER

  • Real-time Data Integration

    Inspired by Linkedins Kafka Data

    Integration design pattern.

    Take all the data and it into a central log

    repository for real-time subscription.

    What if the data is too raw to

    used, even binary encoded with

    no visible business logic

    information?13

  • Real-time Data Integration

    Solution 01: Process Data before pushing it to Kafka

    Not viable:

    Data sources have limited computation resources dedicated to

    log collection.

    Not scalable.

    Too hard to maintain.

    Weve to push it RAW.

    14

  • Real-time Data Integration

    Solution 01: Process Data before pushing it to Kafka

    Solution 02: The consumers/Data subscribers have to

    process Data before using it.

    Drawbacks:

    All the consumers must implement the same business logic, run it

    against the same Data.

    Any changes/updates in the processing logic will require an update of

    all the consumers.

    Provide a useable Data format to each consumer/Data

    subscriber.

    15

  • Real-time Data Integration

    Solution 01: Process Data before pushing it to Kafka

    Solution 02: The consumers/Data subscribers have to process Data before using it

    Solution 03: Process Kafkas raw Data and push it back in

    decoded/enriched format for subscribers.

    Benefits:

    The business logic will be implemented in one place.

    Resource efficient.

    Data subscribers can focus more on their own business logic.

    Simple handling of sources/clients evolution.

    Challenges:

    Keep the Data moving real time.

    The Data processing pipeline must be very fast

    16

  • Real-time Data Integration

    Client 1

    Client 2

    Client...

    Client x

    HADOOP

    Impala

    Hive

    Hbase

    Kafka TOPIC RAW

    Kafka TOPIC V1

    Kafka TOPIC V.

    Kafka TOPIC Vx

    SYS

    1TO

    SYS

    2

    SYS

    ...

    TO

    Collector

    App

    Kafka

    Producer

    ARC

    HIV

    E

    SYS

    x

    Enriched

    Data

    Vx

    DFS

    Raw

    Data

    DFS

    Enriched

    Data

    V1

    DFS

    Enriched

    Data

    V...

    DFS

    Data

    source 1

    Data

    Source 2

    Data

    Source x

    Data

    Source ...

    Streaming

    Streaming

    Streaming

    17

  • Real-time Data Integration with

    Apache Kafka and Spark?

    Started a POC on Spark Streaming.

    Didnt answer our needs:

    Poor back pressure handling

    Jobs kept failing OOM on busy hours.

    Micro-batching & Latency.

    Many configuration parameters:

    The tested version used an HDFS WAL as fault tolerance

    mechanism but this should be handled by Kafka.

    18

  • Apache Flink : The Elegant

    True streaming, no more micro-batching !

    Nice back pressure handling.

    Fault tolerant, exactly once processing.

    High-Throughput.

    Scalable.

    An open source platform for distributed stream and batch data processing.

    Rich functional APIs.

    (Almost) no constraints on serialization.

    Control of parallelism at all execution levels.

    Flexibility and ease of extension.

    And more nice stuffs.

    19

  • IoT / Mobile Applications

    20

    Events occur on devices

    Queue / Log

    Events analyzed in a

    data streaming

    system

    Stream Analysis

    Events stored in a log

  • LUX: Logged User eXperience

  • LUX: Logged User eXperience

    4 Billions

    raw events/day

    700 GB/day(Raw Data / snappy

    compressed)

    100 Data Sources

    6 Main Data

    Types

    26 Kafka Topics

    6 raw

    20 enriched

    CDH5 Cluster

    2 Brokers Kafka

    Cluster

    20 DataNodes

    750TB

    22

  • 23

    Mobile CDR use case

    Client 1

    Client 2

    Client...

    Client x

    HADOOP

    CDR_BIN

    CDR_DECODED

    CDR_ENRICHED

    CDR_ENRICHED_ELASTIC

    Planck-

    Collector

    DECODED

    CDR

    HDFS

    Elast icsearch

    2 Weeks retention

    Other IT Systems (

    commercial, ...)

    KPI

    Views

    HDFS

    Each Machine

    generates a binary file

    every 5min or 2MB

    Binary

    Decoder

    Common

    CDR

    Enricher

    Lookup Live

    Reference Data

    Alarms/Live

    Counters

    Zabbix

    Elast icsearch

    Formater

    15 min

    Window

    Counters

    Historical

    Data

    ENRICHED

    CDR

    HDFS

    BINARY

    CDR

    HDFS

    REFERENCE

    DATA

    HDFS

    K2ES

    LOGSTASH

    Kafka

    Mirroring

    Lookup Live

    Reference Data

    Network

    equipment

    Network

    equipment

    Network

    equipment

    Network

    equipment

    Network

    equipment

    Network

    equipment

  • And it Rocks !

    We ran stress tests on our biggest raw Kafka topic:

    A day of Data.

    2 Billions events (480Gib compressed).

    10 Kafka partitions

    10 Flink TaskManagers (Only 1GB Memory each)

    Enrich Rate (in Tickets/second)Total Processing time (ms)

    Kafka I/O Duration (ms)

    24

  • And it Rocks !

    We ran stress tests on our biggest raw Kafka topic:

    A day of Data.

    2 Billions events (480Gib compressed).

    10 Kafka partitions

    10 Flink TaskManagers (Only 1GB Memory each)

    Total Processing time (ms)

    Kafka I/O Duration (ms)

    25

    500 000 events/sec

    1 day data

    processed in 1h.

    Enrich Rate (in Tickets/second)

    Less than 200 ms

    Processing Time

  • What we loved using

    Flink/Notable features

    Development cost.

    Ease of testing & development.

    Works exactly the way you expect it to work.

    Local Execution mode.

    No more OOM.

    Efficient resource management.

    Excellent performance even with limited

    resources.

    26

  • What we loved using

    Flink/Notable features

    Viele danke Data-Artisans !

    Merci Beaucoup

    True streaming from different sources including

    Kafka

    Exactly-once, low-latency, High-throughput

    stream processing.

    Yarn mode features:

    Yarn yarn.maximum-failed-containers

    Yarn detached-mode

    27

  • Whats Next?

    Connect LUX to new sources.

    Use of JobManager High Availability

    Archive Data on HDFS using the new

    filesystem sink.

    Index Elasticsearch Data using the new

    Elasticsearch sink.

    Flink ML.

    Contributions to the Flink Project.

    28

  • Questions