tiad 2016 : real-time data processing pipeline & visualization with docker, spark, kafka and...

18
Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra Roberto G. Hashioka – 2016-10-04 – TIAD – Paris

Upload: the-incredible-automation-day

Post on 16-Apr-2017

126 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka

and Cassandra

Roberto G. Hashioka – 2016-10-04 – TIAD – Paris

Page 2: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

Personal Information

• Roberto Gandolfo Hashioka

• @rogaha (Github) e @rhashioka (Twitter)

• Finance -> Software Engineer

• Growth & Data Engineer at Docker

Page 3: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

Summary

• Background / Motivation

• Project Goals

• How to build it?

• DEMO

Page 4: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

Background

• Gather of data from multiple sources and process them in “real-time”

• Transform raw data into meaningful and useful information used to enable more effective

decision-making process

• Provide more visibility into trends on: 1) user behavior 2) feature engagement 3) opportunities

for future investments

• Data transparency and standardization

Page 5: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

Project Goals

• Create a data processing pipeline that can handle a huge amount of events per second

• Automate the development environment — Docker compose.

• Automate the remote machines management — Docker for AWS / Machine.

• Reduce the time to market / time to development — New hires / new features.

Page 6: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

Project / Language Stack

Page 7: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

How to build it?

• Step 1: Install Docker for Mac/Win and dockerize all the applications

link: https://www.docker.com/products/docker

Page 8: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

Exemplo de Dockerfile-----------------------------------------------------------------------------------------------------------

FROM ubuntu:14.04

MAINTAINER Roberto Hashioka ([email protected])

RUN apt-get update && apt-get install -y nginx

RUN echo “Hello World! #TIAD” > /usr/share/nginx/html/index.html

EXPOSE 80

------------------------------------------------------------------------------------------------------------

$ docker build –t rogaha/web_demotiad2016 . $ docker run –d –p 80:80 –-name web_demotiad2016 rogaha/web_demotiad2016

Page 9: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

How to build it?

• Step 2: Define your services stack with a docker-compose file

Page 10: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

Docker Compose

containers: web: build: . command: python app.py ports: - "5000:5000" volumes: - .:/code links: - redis environment: - PYTHONUNBUFFERED=1 redis: image: redis:latest command: redis-server --appendonly yes

Page 11: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

How to build it?

• Step 3: Test the applications locally from your laptop using containers

Page 12: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

How to build it?

Page 13: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

How to build it?

• Step 4: Provision your remote servers and deploy your containers

Page 14: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

How to build it?

Page 15: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

How to build it?

• Step 5: Scale your services with Docker swarm

Page 16: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

DEMOsource code: https://github.com/rogaha/data-processing-pipeline

Page 17: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

Open Source Projects Used• Docker (https://github.com/docker/docker)

• An open platform for distributed applications for developers and sysadmins

• Apache Spark / Spark SQL (https://github.com/apache/spark)

• A fast, in-memory data processing engine. Spark SQL lets you query structured data as a resilient distributed dataset (RDD)

• Apache Kafka (https://github.com/apache/kafka)

• A fast and scalable pub-sub messaging service

• Apache Zookeeper (https://github.com/apache/zookeeper)

• A distributed configuration service, synchronization service, and naming registry for large distributed systems

• Apache Cassandra (https://github.com/apache/cassandra)

• Scalable, high-available and distributed columnar NoSQL database

• D3 (https://github.com/mbostock/d3)

• A JavaScript visualization library for HTML and SVG.

Page 18: TIAD 2016 : Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

Thanks!Questions?

@rhashioka