splout sql - web latency sql views for hadoop

Splout SQLWhen Big Data Output is also Big

Data

Iván de Prado Alonso – CEO of Datasaltwww.datasalt.es@ivanprado@datasalt

Full SQL*

For Big Data

Web latency & throughput

Unlike NoSQL

Unlike RDBMS

Unlike Impala, Apache Drill, etc.

* Within each partition

How does it work?

Isolation between generation and serving

Table CLIENTS

CID Name

U20 Doug

U21 Ted

Table SALES

SID CID Amount

S100 U20 102

S101 U20 60

Tablespace CLIENTS_INFOTable CLIENTS

CID Name

U20 Doug

U21 Ted

U40 John

Table SALES

SID CID Amount

S100 U20 102

S101 U20 60

S223 U40 99

table CLIENTS

table SALES

Generate tablespace CLIENTS_INFO with partitioned by CID

partitioned by CID

Partition U10 – U35


Table CLIENTS

CID Name

U40 John

Table SALES

SID CID Amount

S223 U40 99

Generation

Table CLIENTS

CID Name

U20 Doug

U21 Ted

Table SALES

SID CID Amount

S100 U20 102

S101 U20 60

SELECT Name, sum(Amount) FROM CLIENTS c, SALES s WHERE c.CID = s.CID AND CID = ‘U20’;

For key = ‘U20’, tablespace=‘CLIENTS_INFO’


Serving


Table CLIENTS

CID Name

U40 John

Table SALES

SID CID Amount

S223 U40 99

SELECT Name, sum(Amount) FROM CLIENTS c, SALES s WHERE c.CID = s.CID AND CID = ‘U40’;

For key = ‘U40’, tablespace=‘CLIENTS_INFO’

Serving


Table CLIENTS

CID Name

U40 John

Table SALES

SID CID Amount

S223 U40 99

Table CLIENTS

CID Name

U20 Doug

U21 Ted

Table SALES

SID CID Amount

S100 U20 102

S101 U20 60


Why does it scale?

Data is partitioned

Partitions are distributed across nodes

Adding more nodes increases capacity

Queries restricted to a single partition

Generation does not impact serving

Ok, so what is Splout SQLuseful for?

Big Data Analytics

Manageable output

Big Data Analytics

Sometimes Big Data output is also Big Data

Splout SQL allowsto serve

Big Data results

Let’s see an example …

Building a Google Analytics

Imagine that one crazy day you decide to build some kind of Google Analytics…

Millions of domains

Zillions of events

Individual panel per domain

Time-based charts (day/hour aggregations)

Flexible dimension breakdown

Per page, per browserPer country, per language …

Requirements

With Splout SQL

Splout SQL provides SQL consolidated views for Hadoop

data

Let’s see more details about

Splout SQL

Splout SQL Architecture

Each partition is …

Backed by SQLite

Generated on Hadoop

Distributed on Splout SQL cluster

Including any indexes needed

With replication for failover

Data can be sorted before insertion to minimize disk seeks at query time

Pre-sampling for balancing partition size

Atomicity

A tablespace is a set of tables that share the same partitioning schema

Tablespaces are versioned

Several tablespaces can be deployedat once

Only one version served at a time

All-or-nothing semantics (atomicity)

Rollback support

Characteristics

Ensured ms latencies

Controlled by the developer selecting the proper:

Even when queries hit disk

- Cluster topology- Partitioning- Indexes- Data collocation (insertion order)

Characteristics (II)

100% SQL

But restricted to a single partition

Real-time aggregations

Joins

ScalabilityIn data capacity

In performance

Characteristics (III)

Atomicity

New data replaces old data all at once

High availability

Through the use of replication

Open Source

Characteristics (IV)

Easy to manage

Read only

Data is updated in batches

Changing the size of the cluster can be done without any downtime

Updates come from new tablespacedeployments

Characteristics (V)

Native connectorsHive

Pig

Cascading

API - Generation

Command line

Loading CSV files

Java API

Connectors

$ hadoop jar splout-*-hadoop.jar generate …

API - Service

Rest API

JSON response

API - Console

Benchmark

350 GB Wikipedia logs

2-machines cluster900 queries/second, 80 ms/query, 80 threads

Aggregation queries impacting 15 rows in average

Benchmark (II)

4-machines cluster3150 queries/second, 40 ms/query, 160 threads

More info:http://sploutsql.com/performance.html

http://sploutsql.com/performance.html




Web latency

SQL

Consolidated Views

For Hadoop

“A good candidate for the serving layer of a lambda architecture”

www.SploutCloud.com - Splout SQL as a service

http://www.SploutCloud.com

Future work

Growing the community

Do you want to collaborate?

Automatic rebalancing on failover

Almost done

Some read/write capabilities

Enabling Splout SQL to become the speed layer on lambda architectures

Iván de Prado Alonso – CEO of Datasaltwww.datasalt.es@ivanprado@datasalt

Questions?

splout sql - web latency sql views for hadoop

Technology