splout sql - web latency sql views for hadoop
DESCRIPTION
There are many Big Data problems whose output is also Big Data. In this presentation we will show Splout SQL, which allows serving an arbitrarily big dataset by partitioning it. Splout serves partitioned SQL views which are generated and indexed by Hadoop. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. Hadoop is nowadays the de-facto open-source solution for Big Data batch-processing. When the output of a Hadoop process is big, there isn`t a satisfying solution for serving it. Think of pre-computed recommendations, for example, where the whole dataset may vary from one day to another. Splout decouples database creation from database serving and makes it efficient and safe to deploy Hadoop-generated datasets. There are many databases that allow serving Big Data such as NoSQL solutions, but they don`t have a rich query language like SQL. You generally can`t aggregate data in real-time like you would do with a GROUP BY clause. Because you can`t precompute everything, SQL is a very convenient feature to have in a Big Data serving solution. Splout is not a “fast analytics” engine. Splout is made for demanding web or mobile applications where query performance is critical. Arbitrary real-time aggregations should be done in less than 200 milliseconds under high traffic load. On top of that, Splout is scalable, flexible, RESTful & open-source.TRANSCRIPT
Splout SQLWhen Big Data Output is also Big
Data
Iván de Prado Alonso – CEO of Datasaltwww.datasalt.es@ivanprado@datasalt
Full SQL*
For Big Data
Web latency & throughput
Unlike NoSQL
Unlike RDBMS
Unlike Impala, Apache Drill, etc.
* Within each partition
How does it work?
Isolation between generation and serving
Table CLIENTS
CID Name
U20 Doug
U21 Ted
Table SALES
SID CID Amount
S100 U20 102
S101 U20 60
Tablespace CLIENTS_INFOTable CLIENTS
CID Name
U20 Doug
U21 Ted
U40 John
Table SALES
SID CID Amount
S100 U20 102
S101 U20 60
S223 U40 99
table CLIENTS
table SALES
Generate tablespace CLIENTS_INFO with partitioned by CID
partitioned by CID
Partition U10 – U35
Partition U36 – U60
Table CLIENTS
CID Name
U40 John
Table SALES
SID CID Amount
S223 U40 99
Generation
Table CLIENTS
CID Name
U20 Doug
U21 Ted
Table SALES
SID CID Amount
S100 U20 102
S101 U20 60
SELECT Name, sum(Amount) FROM CLIENTS c, SALES s WHERE c.CID = s.CID AND CID = ‘U20’;
For key = ‘U20’, tablespace=‘CLIENTS_INFO’
Partition U10 – U35
Serving
Partition U36 – U60
Table CLIENTS
CID Name
U40 John
Table SALES
SID CID Amount
S223 U40 99
SELECT Name, sum(Amount) FROM CLIENTS c, SALES s WHERE c.CID = s.CID AND CID = ‘U40’;
For key = ‘U40’, tablespace=‘CLIENTS_INFO’
Serving
Partition U36 – U60
Table CLIENTS
CID Name
U40 John
Table SALES
SID CID Amount
S223 U40 99
Table CLIENTS
CID Name
U20 Doug
U21 Ted
Table SALES
SID CID Amount
S100 U20 102
S101 U20 60
Partition U10 – U35
Why does it scale?
Data is partitioned
Partitions are distributed across nodes
Adding more nodes increases capacity
Queries restricted to a single partition
Generation does not impact serving
Ok, so what is Splout SQLuseful for?
Big Data Analytics
Manageable output
Big Data Analytics
Sometimes Big Data output is also Big Data
Splout SQL allowsto serve
Big Data results
Let’s see an example …
Building a Google Analytics
Imagine that one crazy day you decide to build some kind of Google Analytics…
Millions of domains
Zillions of events
Individual panel per domain
Time-based charts (day/hour aggregations)
Flexible dimension breakdown
Per page, per browserPer country, per language …
Requirements
With Splout SQL
Splout SQL provides SQL consolidated views for Hadoop
data
Let’s see more details about
Splout SQL
Splout SQL Architecture
Each partition is …
Backed by SQLite
Generated on Hadoop
Distributed on Splout SQL cluster
Including any indexes needed
With replication for failover
Data can be sorted before insertion to minimize disk seeks at query time
Pre-sampling for balancing partition size
Atomicity
A tablespace is a set of tables that share the same partitioning schema
Tablespaces are versioned
Several tablespaces can be deployedat once
Only one version served at a time
All-or-nothing semantics (atomicity)
Rollback support
Characteristics
Ensured ms latencies
Controlled by the developer selecting the proper:
Even when queries hit disk
- Cluster topology- Partitioning- Indexes- Data collocation (insertion order)
Characteristics (II)
100% SQL
But restricted to a single partition
Real-time aggregations
Joins
ScalabilityIn data capacity
In performance
Characteristics (III)
Atomicity
New data replaces old data all at once
High availability
Through the use of replication
Open Source
Characteristics (IV)
Easy to manage
Read only
Data is updated in batches
Changing the size of the cluster can be done without any downtime
Updates come from new tablespacedeployments
Characteristics (V)
Native connectorsHive
Pig
Cascading
API - Generation
Command line
Loading CSV files
Java API
Connectors
$ hadoop jar splout-*-hadoop.jar generate …
API - Service
Rest API
JSON response
API - Console
Benchmark
350 GB Wikipedia logs
2-machines cluster900 queries/second, 80 ms/query, 80 threads
Aggregation queries impacting 15 rows in average
Benchmark (II)
4-machines cluster3150 queries/second, 40 ms/query, 160 threads
More info:http://sploutsql.com/performance.html
Web latency
SQL
Consolidated Views
For Hadoop
“A good candidate for the serving layer of a lambda architecture”
www.SploutCloud.com - Splout SQL as a service
Future work
Growing the community
Do you want to collaborate?
Automatic rebalancing on failover
Almost done
Some read/write capabilities
Enabling Splout SQL to become the speed layer on lambda architectures
Iván de Prado Alonso – CEO of Datasaltwww.datasalt.es@ivanprado@datasalt
Questions?