building a real-time data pipeline with spark, kafka, and python

17

Upload: memsql

Post on 15-Apr-2017

1.547 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Page 2: Building a Real-Time Data Pipeline with Spark, Kafka, and Python

Douglas ButlerProduct Manager

Page 3: Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Page 4: Building a Real-Time Data Pipeline with Spark, Kafka, and Python

massively parallel, lock free, FASTdistributed SQL database

in-memory, on-diskACID

JSON and geospatialtransactions and analytics

Page 5: Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Page 6: Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Page 7: Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Page 8: Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Page 9: Building a Real-Time Data Pipeline with Spark, Kafka, and Python

2 Minute Install

Page 10: Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Page 11: Building a Real-Time Data Pipeline with Spark, Kafka, and Python

A Simple Pipeline

Page 12: Building a Real-Time Data Pipeline with Spark, Kafka, and Python

from pystreamliner.api import Extractor

class CustomExtractor(Extractor): def initialize(self, streaming_context, sql_context, config, interval, logger): logger.info("Initialized Extractor") def next(self, streaming_context, time, sql_context, config, interval, logger): rdd = streaming_context._sc.parallelize([[x] for x in range(10)]) return sql_context.createDataFrame(rdd, ["number"])

Page 13: Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Page 14: Building a Real-Time Data Pipeline with Spark, Kafka, and Python
Page 15: Building a Real-Time Data Pipeline with Spark, Kafka, and Python

> memsql-ops pip install [package]

distributed cluster-wide

any Python package

bring your own

Page 16: Building a Real-Time Data Pipeline with Spark, Kafka, and Python

Real-time pipeline

Page 17: Building a Real-Time Data Pipeline with Spark, Kafka, and Python

Q & A time