o'reilly media webcast: building real-time data pipelines

57
Building Real-Time Data Pipelines Through In-Memory Architectures Ben Lorica, Chief Data Scientist, O'Reilly Media @bigdata Eric Frenkiel, CEO & Co-Founder, MemSQL @ericfrenkiel

Upload: memsql

Post on 16-Jan-2017

1.337 views

Category:

Software


2 download

TRANSCRIPT

Page 1: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Building Real-Time Data Pipelines Through In-Memory Architectures

Ben Lorica, Chief Data Scientist, O'Reilly Media@bigdata

Eric Frenkiel, CEO & Co-Founder, MemSQL@ericfrenkiel

Page 2: O'Reilly Media Webcast: Building Real-Time Data Pipelines

What’s In Store

Why In-Memory for Real Time

Using an In-Memory Database with Spark and Kafka

Real-Time Use Cases and Demonstrations

About MemSQL

Page 3: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Going Real-Time is the Next Phase for Big Data

More Sensors

More Interconnectivity

More User Demand

…and companies are at risk of being left behind

Page 4: O'Reilly Media Webcast: Building Real-Time Data Pipelines

ExpensiveNot scalableBatch onlySAN-burdened

1%

Page 5: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Success will be driven by real-time analytic applications

Page 6: O'Reilly Media Webcast: Building Real-Time Data Pipelines

What’s In Store

Why In-Memory for Real Time

Using an In-Memory Database with Spark and Kafka

Real-Time Use Cases and Demonstrations

About MemSQL

Page 7: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Speed

ServingBatch Fast Updates

Unified queries, full SQL

Fast Appends

A Fresh Look at Lambda Architectures

Page 8: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Comprehensive Architecture

Tran

sact

ions

Page 9: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

RowstoreTran

sact

ions

Page 10: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

Analytics

Tran

sact

ions

Page 11: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tran

sact

ions

Page 12: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tran

sact

ions

Execution engine that spans the data spectrum

Page 13: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Comprehensive Architecture

Real TimeSpeed/Streaming LayerFast Updates

Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tran

sact

ions

Page 14: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Simplified Lambda Architectures with MemSQL

Layer Traditional Lambda MemSQL Lambda

Batch Hadoop MemSQL Column Store

Speed Storm, Spark Kafka > Spark > MemSQL

Serving Cassandra, HBase MemSQL

Page 15: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Designing the Ideal Real-Time Pipeline

Message Queue Transformation Speed/Serving Layer

End-to-End Data Pipeline Under One Second

Page 16: O'Reilly Media Webcast: Building Real-Time Data Pipelines

A high-throughput distributed messaging system

Publish and subscribe to Kafka “topics”

Centralized data transport for the organization

Kafka

Page 17: O'Reilly Media Webcast: Building Real-Time Data Pipelines

In-memory execution engine

High level operators for procedural and programmatic analytics

Faster than MapReduce

Spark

Page 18: O'Reilly Media Webcast: Building Real-Time Data Pipelines

In-memory, distributed database

Full transactions and complete durability

Enable real-time, performant applications

MemSQL

Page 19: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Lambda Applies to Real-Time Data Pipelines

Message Queue

Batch

Inputs DatabaseTransformation Application

Page 20: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Kafka, Spark, and MemSQL Make it Simple

Batch

Inputs Application

Page 21: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Put Apache Spark in the fast lanewith MemSQL Streamliner

Page 22: O'Reilly Media Webcast: Building Real-Time Data Pipelines

One click deployment of integrated Apache Spark

Put Spark in the Fast Lane• GUI pipeline setup• Multiple data pipelines• Real-time transformation

Eliminates batch ETL Open source on GitHub

Introducing the MemSQL Streamliner

Page 23: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Simple Deployment Process

Application

Page 24: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Cluster

1. Deploy MemSQL

In-Memory | Distributed | Relational

Application

Page 25: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Cluster

2. Deploy Spark

Application

Page 26: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Cluster

Kafka Connects to Each Node

Application

Page 27: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Streamliner Architecture

First of many integrated Apache Spark solutions

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Page 28: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Streamliner ETL Detail

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Custom

Future Extractor

JSON

Custom

Future Transformer

STREAMLINER

Extract Transform Load

Page 29: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Streamliner

Page 30: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Extract

Page 31: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Transform

Page 32: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Load

Page 33: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Streamliner: Dynamic Resource ManagementWithout Streamliner With StreamlinerPipeline 1

Spark Worker

Pipeline 2

Spark Worker

Executor (P2 only)

Executor (P2 only)

Executor (P1 only)

Executor (P1 only)

Driver (P1 only)

Driver (P2 only)

All Pipelines

Streamliner Driver…

Spark WorkerSpark Worker

Executor (P1 or P2)

Executor (P1 or P2)

Executor (P1 or P2)

Executor (P1 or P2)

Page 34: O'Reilly Media Webcast: Building Real-Time Data Pipelines

What’s In Store

Why In-Memory for Real Time

Using an In-Memory Database with Spark and Kafka

Real-Time Use Cases and Demonstrations

About MemSQL

Page 35: O'Reilly Media Webcast: Building Real-Time Data Pipelines

One Architecturefor Many Applications

Page 36: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Monitoring real-time Xfinity programming and video health

Page 37: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Collect streaming data at scale (hundreds of MemSQL machines)

Proactively diagnose issues Query ad-hoc and in real-time

with full SQL

From 30 minutes to less than 1 second

Real-time Analytics

Page 38: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Real-Time Trend Analytics

Page 39: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Massive Ingest and Concurrent Analytics Instant accuracy to the latest repin Build real-time analytic applications

Real-time analytics

Page 40: O'Reilly Media Webcast: Building Real-Time Data Pipelines
Page 41: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Real-Time Segmentation

Page 42: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Using Real-Time for Personalization

Ad Servers EC2

Real-time analytics

PostgreSQLLegacy reports

Monitoring S3 (replay)HDFS

Data Science

VerticaOperational Data Store (ODS)

Star Schema MictoStrategy

Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times

Page 43: O'Reilly Media Webcast: Building Real-Time Data Pipelines

MemCityCapturing data from 1.4 million householdsTotal AWS hardware costs at $2.35 per hour

Page 44: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Subscribing to Kafka

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

0111001010101111101111100000001010111100001110101100000010010010111…

Publish to Kafka Topic

0111001010101111101111100000001010111100001110101100000010010010111…

1110010101000101010001010100010111111010100011110101100011010101000…

0101111000011100101010111110001111011010111100000000101110101100000…

Event added to message queue

Page 45: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Enrich and Transform the Data

Spark polling Kafka for new messages

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)

Deserialization

Enrichment

0111001010101111101111100000001010111100001110101100000010010010111…

Page 46: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Persist and Prepare for Production

RDD.saveToMemSQL()

INSERT INTO memcity_table ...

time house_id zip device_id device_type watts

2015-07-

06T16:43:40.33

Z

329280 94110 23 ‘kitchen_appliance’ 60

… … … … … …

Page 47: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Go to Production

Compress development timelines

SELECT ... FROM memcity_table ...

Page 48: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Building Real-Time Data Pipelines and Predictive Applications

Page 49: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Adding Real-Time Scoring to Predictive Applications

StreamlinerInput

User JarSAS Generated PMML

Industrial Equipment

Sensor Data

S1 S2 S3 P1 P2 P3

Scoring Real-Time Data with Predictive Models

Sensor 1 Predictive Model 1

Page 50: O'Reilly Media Webcast: Building Real-Time Data Pipelines

What’s In Store

Why In-Memory for Real Time

Using an In-Memory Database with Spark and Kafka

Real-Time Use Cases and Demonstrations

About MemSQL

Page 51: O'Reilly Media Webcast: Building Real-Time Data Pipelines

MemSQL at a Glance

• Enable every company to be a real-time enterprise• Founded 2011, based in San Francisco• Founders are ex-Facebook, SQL Server engineers• Deliver a database technology for modern

architecture

Enterprise Focus

Page 52: O'Reilly Media Webcast: Building Real-Time Data Pipelines

The Real-Time Database for Transactions and Analytics 

In-Memory Distributed Relational

Data CenterSoftware Cloud

Page 53: O'Reilly Media Webcast: Building Real-Time Data Pipelines

MemSQL for the Spectrum of Transactions

Each Transaction Paramount Transactional Aggregates Paramount

Guarantee that every individual transaction is persisted

No individual transaction can be lost• Financial credits and debits• Inventory movement• Employee status

Capture massive event streams for immediate analysis

Transaction repetition/redundancy at the device level

• Event data and clickstreams• Sensor data, Internet of Things• Mobile applications• Real-time streams

Page 54: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Gartner Magic Quadrant for ODBMS

Leading Relational Database in

Visionaries Quadrant

Page 55: O'Reilly Media Webcast: Building Real-Time Data Pipelines

Forrester Wave: In-Memory Database Platforms

”“MemSQL Named Strong Performer

Page 56: O'Reilly Media Webcast: Building Real-Time Data Pipelines
Page 57: O'Reilly Media Webcast: Building Real-Time Data Pipelines

GET YOUR FREE COPY:memsql.com/oreilly