Building a Real-Time Data Pipeline with Spark, Kafka, and Python

Download Building a Real-Time Data Pipeline with Spark, Kafka, and Python

Post on 15-Apr-2017

1.532 views

Category:

Data & Analytics

1 download

TRANSCRIPT

  • Douglas ButlerProduct Manager

  • massively parallel, lock free, FASTdistributed SQL database

    in-memory, on-diskACID

    JSON and geospatialtransactions and analytics

  • 2 Minute Install

  • A Simple Pipeline

  • from pystreamliner.api import Extractor

    class CustomExtractor(Extractor): def initialize(self, streaming_context, sql_context, config, interval, logger): logger.info("Initialized Extractor") def next(self, streaming_context, time, sql_context, config, interval, logger): rdd = streaming_context._sc.parallelize([[x] for x in range(10)]) return sql_context.createDataFrame(rdd, ["number"])

  • > memsql-ops pip install [package]

    distributed cluster-wide

    any Python package

    bring your own

  • Real-time pipeline

  • Q & A time