o'reilly media webcast: building real-time data pipelines

Building Real-Time Data Pipelines Through In-Memory Architectures

Ben Lorica, Chief Data Scientist, O'Reilly Media@bigdata

Eric Frenkiel, CEO & Co-Founder, MemSQL@ericfrenkiel

What’s In Store

Why In-Memory for Real Time

Using an In-Memory Database with Spark and Kafka

Real-Time Use Cases and Demonstrations

About MemSQL

Going Real-Time is the Next Phase for Big Data

More Sensors

More Interconnectivity

More User Demand

…and companies are at risk of being left behind

ExpensiveNot scalableBatch onlySAN-burdened

1%

Success will be driven by real-time analytic applications

What’s In Store




About MemSQL

Speed

ServingBatch Fast Updates

Unified queries, full SQL

Fast Appends

A Fresh Look at Lambda Architectures

Comprehensive Architecture

Tran

sact

ions


Real TimeSpeed/Streaming LayerFast Updates

RowstoreTran

sact

ions



Rowstore

Analytics

Tran

sact

ions



Rowstore

HistoricalBatch Layer

Fast Appends

Columnstore

Analytics

Tran

sact

ions



Rowstore


Fast Appends

Columnstore

Analytics

Tran

sact

ions

Execution engine that spans the data spectrum



Rowstore


Fast Appends

Columnstore

Analytics

Tran

sact

ions

Simplified Lambda Architectures with MemSQL

Layer Traditional Lambda MemSQL Lambda

Batch Hadoop MemSQL Column Store

Speed Storm, Spark Kafka > Spark > MemSQL

Serving Cassandra, HBase MemSQL

Designing the Ideal Real-Time Pipeline

Message Queue Transformation Speed/Serving Layer

End-to-End Data Pipeline Under One Second

A high-throughput distributed messaging system

Publish and subscribe to Kafka “topics”

Centralized data transport for the organization

Kafka

In-memory execution engine

High level operators for procedural and programmatic analytics

Faster than MapReduce

Spark

In-memory, distributed database

Full transactions and complete durability

Enable real-time, performant applications

MemSQL

Lambda Applies to Real-Time Data Pipelines

Message Queue

Batch

Inputs DatabaseTransformation Application

Kafka, Spark, and MemSQL Make it Simple

Batch

Inputs Application

Put Apache Spark in the fast lanewith MemSQL Streamliner

One click deployment of integrated Apache Spark

Put Spark in the Fast Lane• GUI pipeline setup• Multiple data pipelines• Real-time transformation

Eliminates batch ETL Open source on GitHub

Introducing the MemSQL Streamliner

Simple Deployment Process

Application

Cluster

1. Deploy MemSQL

In-Memory | Distributed | Relational

Application

Cluster

2. Deploy Spark

Application

Cluster

Kafka Connects to Each Node

Application

Streamliner Architecture

First of many integrated Apache Spark solutions

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Streamliner ETL Detail

Other Real-Time Data

Sources Application

Apache Spark

Future Solution

Future Machine Learning Solution

STREAMLINER

Custom

Future Extractor

JSON

Custom

Future Transformer

STREAMLINER

Extract Transform Load

Streamliner

Extract

Transform

Streamliner: Dynamic Resource ManagementWithout Streamliner With StreamlinerPipeline 1

Spark Worker

Pipeline 2

Spark Worker

Executor (P2 only)

Executor (P2 only)

Executor (P1 only)

Executor (P1 only)

Driver (P1 only)

Driver (P2 only)

All Pipelines

Streamliner Driver…

…

Spark WorkerSpark Worker

Executor (P1 or P2)

Executor (P1 or P2)

Executor (P1 or P2)

Executor (P1 or P2)

What’s In Store




About MemSQL

One Architecturefor Many Applications

Monitoring real-time Xfinity programming and video health

Collect streaming data at scale (hundreds of MemSQL machines)

Proactively diagnose issues Query ad-hoc and in real-time

with full SQL

From 30 minutes to less than 1 second

Real-time Analytics

Real-Time Trend Analytics

Massive Ingest and Concurrent Analytics Instant accuracy to the latest repin Build real-time analytic applications

Real-time analytics

Real-Time Segmentation

Using Real-Time for Personalization

Ad Servers EC2

Real-time analytics

PostgreSQLLegacy reports

Monitoring S3 (replay)HDFS

Data Science

VerticaOperational Data Store (ODS)

Star Schema MictoStrategy

Reach overlap and ad optimization Over 60,000 queries per second Millisecond response times

MemCityCapturing data from 1.4 million householdsTotal AWS hardware costs at $2.35 per hour

Subscribing to Kafka

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

0111001010101111101111100000001010111100001110101100000010010010111…

Publish to Kafka Topic

0111001010101111101111100000001010111100001110101100000010010010111…

1110010101000101010001010100010111111010100011110101100011010101000…

0101111000011100101010111110001111011010111100000000101110101100000…

Event added to message queue

Enrich and Transform the Data

Spark polling Kafka for new messages

(2015-07-06T16:43:40.33Z, 329280, 23, 60)

(2015-07-06T16:43:40.33Z, 329280, 94110, 23, ‘kitchen_appliance’, 60)

Deserialization

Enrichment

0111001010101111101111100000001010111100001110101100000010010010111…

Persist and Prepare for Production

RDD.saveToMemSQL()

INSERT INTO memcity_table ...

time house_id zip device_id device_type watts

2015-07-

06T16:43:40.33

Z

329280 94110 23 ‘kitchen_appliance’ 60

… … … … … …

Go to Production

Compress development timelines

SELECT ... FROM memcity_table ...

Building Real-Time Data Pipelines and Predictive Applications

Adding Real-Time Scoring to Predictive Applications

StreamlinerInput

User JarSAS Generated PMML

Industrial Equipment

Sensor Data

S1 S2 S3 P1 P2 P3

Scoring Real-Time Data with Predictive Models

Sensor 1 Predictive Model 1

What’s In Store




About MemSQL

MemSQL at a Glance

• Enable every company to be a real-time enterprise• Founded 2011, based in San Francisco• Founders are ex-Facebook, SQL Server engineers• Deliver a database technology for modern

architecture

Enterprise Focus

The Real-Time Database for Transactions and Analytics

In-Memory Distributed Relational

Data CenterSoftware Cloud

MemSQL for the Spectrum of Transactions

Each Transaction Paramount Transactional Aggregates Paramount

Guarantee that every individual transaction is persisted

No individual transaction can be lost• Financial credits and debits• Inventory movement• Employee status

Capture massive event streams for immediate analysis

Transaction repetition/redundancy at the device level

• Event data and clickstreams• Sensor data, Internet of Things• Mobile applications• Real-time streams

Gartner Magic Quadrant for ODBMS

Leading Relational Database in

Visionaries Quadrant

Forrester Wave: In-Memory Database Platforms

”“MemSQL Named Strong Performer

GET YOUR FREE COPY:memsql.com/oreilly