Who am I?
● Pere Ferrera Bertran, Barcelona @ferrerabertran
● 8 years “backender” @ BCN startups.
● “The shy guy” (aka CTO) @ Datasalt● Hadoop consulting: PeerIndex, Trovit, BBVA
● Open-source low-level API for Hadoop (Pangool)
– Accepted paper: ICDM 2012
● Jazz pianist in the free time
3.5 Big Data Challenges
Moving Big Data seamlessly is also a challenge!
Hadoop
● Mainstream Big Data Storage & Batch Processing ● Open-source.● Large community.● Many higher-level tools.● Many companies around.● It scales.
● Bad things people say about it:● Slow
– but we now have MapR!● Hard to program
– but we have Hive, Pig or Pangool!– and even things like Datameer!
● Buggy– but we have a stable 1.0 and supporting companies like Cloudera!
● Getting better and better! - YARN (2.0)
The Batch Revolution
● Batch is not the only kind of processing
● But it covers many cases very well.
– All our consulting clients use it.● Hadoop makes it transparently scalable!
● I see this as a revolution.● Advantages: Simple, resistant to programming errors.
● Disadvantages: Long-running processes, results updated in hours time.
● My advise: Can you cope with that? Then use batch processing.
● Ted Dunning & Nathan Marz are good “gurus” to hear talk about this.
The problem (we want to solve)
● Big Data usually means having Big Data as input
● A lot of emphasis nowadays in “analytics”, where output is usually small
● Small, targeted reports.
“I will eat all this so that almost nothingremains out of it... “
The problem (we want to solve)
● But the problem is that sometimes the output is also Big Data!● Recommendations● Aggregated stats● Listings
● Recurring problem: Take your “Big Data Output” and “put it” somewhere ● NoSQL● Search engine
● For being able to answer real-time queries, low-latency lookups over it.● Websites, mobile apps. ● A lot of people using the app concurrently.● Read-only!
Current options
● Hadoop-generated files are not (directly) queriable
● They lack appropriate indexes (e.g. b-tree) for making queries really fast
● We can “send” the result of a Hadoop process directly to a database...
● Problems:
– Latency (random writing / rebalancing / index update)– Affecting query service (database may slow down while
updating and serving at the same time)– Incrementality (may lead to inconsistency of results)
Meet Splout SQL!
● Store generation decoupled
from store serving● Data is always optimally indexed.● Zero fragmentation.
● “Atomic” deployment● New versions replaced
without affecting query serving.● All data replaced at once.● Flexible.
● 100% SQL● Rich query language● Real-time aggregations over data ● Not everything needs to be pre-computed!
Details
● A very old idea which everyone implemented by hand at some point.● Horizontal partitioning.
● Generates many database files (partitions) and distributes them in a cluster.● Replication, fail-over.
● Hadoop (Pangool) for generating the data structures.● Including all b-trees needed!
● Database files: SQLite files.
Did you say SQLite?
SQLite
● Fast (10% slower than MySQL)● Simple.● Probably the best embedded SQL out there.
● Embedding it makes it easy to use it inside Hadoop.
● Still, it lacks some features. ● Not the database one would choose for an enterprise app.
● But Splout SQL is essentially read-only!● So we don't need that many features.
Splout != SQLite. In the future we might integrate it with PostgreSQL, for instance.
Making Splout SQL fly
● Because database is created off-line, things like insertion order can be controlled.● Hadoop sorts the data for you.
● So you insert all your data in the appropriated order for making queries fast.● Even if disk is used, only one seek will be needed (because of data locality).
Real-time GROUP BY’s with avg. 2000 records of 50 bytes in average 40 milliseconds in a m1.small EC2 machine.
Recap
● We see a recurring problem when the output is also Big Data.● Moving data between (batch) processing and serving.
● Splout SQL solves it and adds full SQL.
“A web-latency SQL view for Hadoop”
● Web-latency: unlike data warehousing / analytics● SQL: unlike key/value and other NoSQL's● View: simply make files queriable → read-only● For Hadoop: for Big Data output of batch processing
Check it out and play with it!
http://sploutsql.com