splout sql: a web-latency sql view for hadoop

Splout SQL

A web-latency SQL view for Hadoop

http://sploutsql.com

http://sploutsql.com/

Who am I?

● Pere Ferrera Bertran, Barcelona @ferrerabertran

● 8 years “backender” @ BCN startups.

● “The shy guy” (aka CTO) @ Datasalt● Hadoop consulting: PeerIndex, Trovit, BBVA

● Open-source low-level API for Hadoop (Pangool)

– Accepted paper: ICDM 2012

● Jazz pianist in the free time

http://datasalt.com/

3.5 Big Data Challenges

Moving Big Data seamlessly is also a challenge!

Hadoop

● Mainstream Big Data Storage & Batch Processing ● Open-source.● Large community.● Many higher-level tools.● Many companies around.● It scales.

● Bad things people say about it:● Slow

– but we now have MapR!● Hard to program

– but we have Hive, Pig or Pangool!– and even things like Datameer!

● Buggy– but we have a stable 1.0 and supporting companies like Cloudera!

● Getting better and better! - YARN (2.0)

The Batch Revolution

● Batch is not the only kind of processing

● But it covers many cases very well.

– All our consulting clients use it.● Hadoop makes it transparently scalable!

● I see this as a revolution.● Advantages: Simple, resistant to programming errors.

● Disadvantages: Long-running processes, results updated in hours time.

● My advise: Can you cope with that? Then use batch processing.

● Ted Dunning & Nathan Marz are good “gurus” to hear talk about this.

The problem (we want to solve)

● Big Data usually means having Big Data as input

● A lot of emphasis nowadays in “analytics”, where output is usually small

● Small, targeted reports.

“I will eat all this so that almost nothingremains out of it... “

The problem (we want to solve)

● But the problem is that sometimes the output is also Big Data!● Recommendations● Aggregated stats● Listings

● Recurring problem: Take your “Big Data Output” and “put it” somewhere ● NoSQL● Search engine

● For being able to answer real-time queries, low-latency lookups over it.● Websites, mobile apps. ● A lot of people using the app concurrently.● Read-only!

Current options

● Hadoop-generated files are not (directly) queriable

● They lack appropriate indexes (e.g. b-tree) for making queries really fast

● We can “send” the result of a Hadoop process directly to a database...

● Problems:

– Latency (random writing / rebalancing / index update)– Affecting query service (database may slow down while

updating and serving at the same time)– Incrementality (may lead to inconsistency of results)

Meet Splout SQL!

● Store generation decoupled

from store serving● Data is always optimally indexed.● Zero fragmentation.

● “Atomic” deployment● New versions replaced

without affecting query serving.● All data replaced at once.● Flexible.

● 100% SQL● Rich query language● Real-time aggregations over data ● Not everything needs to be pre-computed!

Details

● A very old idea which everyone implemented by hand at some point.● Horizontal partitioning.

● Generates many database files (partitions) and distributes them in a cluster.● Replication, fail-over.

● Hadoop (Pangool) for generating the data structures.● Including all b-trees needed!

● Database files: SQLite files.

Did you say SQLite?

SQLite

● Fast (10% slower than MySQL)● Simple.● Probably the best embedded SQL out there.

● Embedding it makes it easy to use it inside Hadoop.

● Still, it lacks some features. ● Not the database one would choose for an enterprise app.

● But Splout SQL is essentially read-only!● So we don't need that many features.

Splout != SQLite. In the future we might integrate it with PostgreSQL, for instance.

Making Splout SQL fly

● Because database is created off-line, things like insertion order can be controlled.● Hadoop sorts the data for you.

● So you insert all your data in the appropriated order for making queries fast.● Even if disk is used, only one seek will be needed (because of data locality).

Real-time GROUP BY’s with avg. 2000 records of 50 bytes in average 40 milliseconds in a m1.small EC2 machine.

Recap

● We see a recurring problem when the output is also Big Data.● Moving data between (batch) processing and serving.

● Splout SQL solves it and adds full SQL.

“A web-latency SQL view for Hadoop”

● Web-latency: unlike data warehousing / analytics● SQL: unlike key/value and other NoSQL's● View: simply make files queriable → read-only● For Hadoop: for Big Data output of batch processing

Check it out and play with it!

http://sploutsql.com

http://sploutsql.com/

splout sql: a web-latency sql view for hadoop

Technology