splice machine: architecture of an open source rdms powered by hbase and spark

Architecture of an Open Source RDBMS powered by

HBase and Spark

January 12, 2017

Spark Meetup Barcelona

Daniel Gómez Ferro

▪ Introduction to Splice Machine

▪ 1.x: the need for Spark

▪ 2.0: Spark introduction, challenges and wins

▪ Future

Agenda

▪ Splice Machine

▪ Distributed database company

▪ Open source

▪ VC-backed

▪ Offices in San Francisco and St Louis (MO)

Who are we?

What do we do?

The Open Source RDBMS Powered By Hadoop & Spark

ANSI SQLNo retraining or rewrites for SQL-based

analysts, reports, and applications

¼ the Cost Scales out on

commodity hardware

SQL Scale Out Speed

TransactionsEnsure reliable updates

across multiple rows

Mixed WorkloadsSimultaneously support

OLTP and OLAP workloads

ElasticIncrease scale in

just a few minutes

10x FasterLeverages Spark

in-memory technology

Timeline

Where are we?

Founded2012

v1.0Nov 2014

v1.5Oct 2015

v2.0Open Source

Spark integrationJul 2016

v2.5Feb? 2017

1.x: the need for Spark

▪ SQL Parser

▪ SQL Planner

▪ SQL Optimizer

▪ Data storage

▪ Execution Engine

RDBMS building blocks

▪ SQL Parser

▪ SQL Planner

▪ SQL Optimizer

▪ Data storage

▪ Execution Engine

RDBMS building blocks

Apache Derby

Apache HBase

▪ Distributed Sorted Map▪ Keys and Values are byte[]

▪ Lexicographically sorted (like a dictionary)▪ [] < [0x00; 0x00] < [0x01] < [0x01, 0x00]

▪ Key range is partitioned/sharded in ‘regions’

▪ Operations

▪ get(byte[] key) -> byte[]

▪ put(byte[] key, byte[] value)

▪ scan(byte[] start, byte[] stop) -> scanner

▪ Extensions

▪ Coprocessors

HBase Introduction

HBase Architecture

▪ Sits on top of HDFS▪ Distributed append-only Filesystem

▪ Files are immutable once closed

HBase challenges

▪ Log-Structured Merge Tree

▪ Multi level storage▪ Inserts are buffered in memory

▪ Flush this buffer to disk, writing indexed, sorted files

HBase challenges

In-memorybuffer[a; m]

File 1 [a; p]

File 2 [f; z]

▪ Log-Structured Merge Tree

▪ Multi level storage▪ Inserts are buffered in memory

▪ Flush this buffer to disk, writing indexed, sorted files

HBase challenges

File 3 [a; m]

File 1 [a; p]

File 2 [f; z]In-memory

buffer[]

HBase Architecture

▪ HRegion▪ MemStore

▪ One or more HFiles

▪ HLog

▪ Writes

▪ Add it to the MemStore

▪ Write it to the HLog

▪ When the MemStore gets big enough▪ Flush: dump the MemStore into a new HFile

▪ Reads

▪ In parallel from the MemStore and all HFiles

▪ We reused several Derby components▪ JDBC driver▪ SQL Parser/Planner/Optimizer

▪ In-memory data formats

▪ Bytecode generation

▪ Developed some custom solutions

▪ TEMP table for transient data (joins, aggregates, etc.)

▪ Task framework (using HBase’s coprocessors)

▪ Connection pooling

▪ Switched Derby’s datastore for HBase▪ Primary Keys and Indexes make use of HBase’s sorting order

▪ Removing Derby’s assumptions about running on a single machine...

Derby - HBase integration

▪ Great for▪ Operational workloads▪ Replacing non-scalable RDBMS solutions

▪ SQL support▪ SQL 99, Indexes, Triggers, Foreign Keys, cost based optimizer...

Splice Machine 1.x

▪ But...

▪ Struggled on analytical queries

▪ HBase’s compactions created instabilities

▪ Minimum latency was too high (due to Task Framework)

Splice Machine 1.x

▪ But...

▪ Struggled on analytical queries

▪ HBase’s compactions created instabilities

▪ Minimum latency was too high (due to Task Framework)

Splice Machine 1.x

2.0: Spark introduction, challenges and wins

▪ Challenging but natural▪ Matched tree of database operators with RDD transformations

Spark Integration

Aggregate

Scan Restriction

reduceByKey()

join()

newAPIHadoopRDD() filter()

newAPIHadoopRDD()

▪ Abstracted away the Spark API

▪ Two implementations▪ In-memory using Guava’s FluentIterable APIs▪ Distributed using Spark

▪ SQL operations have a single implementation

▪ In-memory use case:▪ OLTP workloads▪ Very low latency▪ Bring data in, perform computation locally

▪ Anti-pattern in distributed systems, but it works

Unified API

▪ Got rid of TEMP table▪ Spark maintains temporary data in memory

▪ Got rid of Task Framework▪ Spark performs the same job, less complexity

▪ Resource isolation▪ HBase and Spark in separate processes

▪ Analytical queries don’t impact as much HBase stability

Spark Integration Benefits

▪ Many serialization boundaries: poor performance▪ HDFS -> HBase -> Spark -> Client

▪ Task granularity▪ Too coarse

▪ Multiple Spark contexts▪ One per HRegionServer

▪ Derby legacy issues▪ Custom datatypes▪ Thread context assumptions

Integration Problems

▪ Remove serialization boundaries

▪ Hybrid scanners: ▪ Custom InputFormat that reads HFiles directly from HDFS into Spark▪ Merges those values with a fast scanner on the MemStore▪ Most data: HDFS -> Spark▪ Small part: HBase (in-memory) -> Spark

▪ Requires some hooks in HBase▪ Compactions remove HFiles, Spark might be reading them▪ Flushes add HFiles

▪ Much better read performance

Solving: serialization boundaries

▪ Increase task granularity

▪ HTableInputFormat default is:▪ 1 region = 1 partition▪ Each region could be 1 GB or more

▪ SpliceInputFormat subdivides regions into blocks (default 32Mb)▪ Better parallelism▪ Better performance

▪ This also needs hooks in HBase (coprocessors)

Solving: task granularity

▪ Single shared Spark context

▪ JobServer wasn’t good enough▪ It would become a bottleneck, results would go through it

▪ Custom JobServer (called OLAPServer)▪ Single Spark context on this server▪ Currently colocated with the HMaster (fault tolerant for free)▪ Makes Spark jobs stream results directly to the client▪ Runs several partitions in parallel▪ Starts streaming as soon as there’s data

Solving: multiple Spark contexts

JobServer vs OLAPServerTi

JobServerStart partition 1Next rowNext row…End partition 1, sendStart partition 2Next rowNext row…End partition 2, send

ClientRun partition 1

Get resultsRun partition 2

Get results

During this time the client is blocked waiting for more data

JobServer vs OLAPServerTi

JobServerStart partition 1Next rowNext row…End partition 1, sendStart partition 2Next rowNext row…End partition 2, send

ClientRun partition 1

Get resultsRun partition 2

Get results

OLAPServerStart partition 1, 2, 3Get and send rowGet and send row…End partition 1Start partition 4Get and send rowGet and send row…End partition 2, 3Start partition 5, 6

ClientRun partition 1, 2, 3Get resultGet result

Run partition 4

Get resultGet result

Run partition 5, 6

Single shared Spark context

▪ Custom datatypes:▪ Custom Kryo serializers for Derby objects

▪ Thread contexts▪ Not completely solved▪ Use TaskContext.addTaskCompletionListener() to cleanup after ourselves▪ Still finding resource leaks from time to time...

Solving: Derby legacy issues

▪ HBase compactions in Spark:▪ HBase compactions can be expensive

▪ Reading and writing lots of data

▪ If they happen in the HBase JVM they can kill OLTP performance

▪ We made possible running them in Spark▪ Maintaining data locality▪ Scheduled among other jobs

▪ Fallback to HBase if the Spark scheduler doesn’t have resources

Other Spark goodies

▪ Integration with Spark streaming:

▪ We can ingest data directly from Spark Streaming

▪ Easy to write to Splice Machine from Kafka through it

Future work and roadmap

▪ Move to DataFrame APIs▪ Catalyst optimizer▪ Whole stage code generation (better than Derby’s codegen)▪ Already transitioned some operations▪ Requires good UnsafeRow support

▪ UnsafeRow▪ Compact in memory representation

▪ Rows are a serialized in a continuous block of memory

▪ Better memory management▪ Less GC time

Future Spark work

Current

DataDes[]

SQLDate SomeOtherField

SQLInteger

UnsafeRow

MemoryBlock: byte[]

Row10, 100

Row2100, 50

▪ Columnar storage format

▪ We already have ‘Pinned’ tables:▪ Create Parquet snapshot of table▪ Get columnar access▪ Good for read-only data

▪ Planning on maintaining dual representation▪ Row-oriented for recently written values▪ Column-oriented for historical data▪ Merge those on the fly

Future Spark work

▪ Better Spark shell integration

▪ Our SparkContext resides in the OLAPServer

▪ Getting data to a Spark shell incurs a serialization boundary▪ From Splice’s SparkContext to the shell context

▪ We want to achieve transparent conversion▪ ResultSet -> DataFrame

Future Spark work

▪ Performance increases across the board▪ TPCC, TPCH, Backup/Restore, ODBC driver…

▪ Incremental backup

▪ Native PL/SQL support (in Beta)▪ No excuses left for migrating those Oracle databases

▪ Client load balancing/failover▪ Via HAProxy

▪ Statistics improvements▪ Histograms, sketching libraries

▪ RDD caching (pinning)

2.5 Roadmap

Thank You!

We are hiring :-)

Open Source: github.com/splicemachine

Community:community.splicemachine.com

Slack channel:tinyurl.com/SMslack

TPCH 100 Load Throughput

TPCH 100 Query Times (seconds)Query

1 395 TRAFODION-2237 99

2 PHOENIX-3322 516 44

3 PHOENIX-3322 TRAFODION-2237 126

4 PHOENIX-3322 TBD 133

6 74 3178 38

7 PHOENIX-3322 4442 220

9 PHOENIX-3322 941 273

11 PHOENIX-3317 463 56

TPCH 100 Query Times (seconds)Query

12 379 TBD 85

18 PHOENIX-3322 TBD SPLICE-34

20 PHOENIX-3320 TBD SPLICE-410

splice machine: architecture of an open source rdms powered by hbase and spark

Engineering

intro to hbase internals & schema design (for hbase users)

om autoelectro pvt....

rdms 8.4 sql user's guide

hbase basic

a consortial approach to rdms

hbase hivepig

hbase in action - chapter 09: deploying hbase

hbase @ meetup

hbase overview

hadoop world: practical hbase: getting the most from your...

hbase internals

hbase presentation

building a linq provider for hbase mapreduce ·...

introduction to rdms

hbase + hue - la hbase user group

hbase - arif.works

formal powerpoint...

faster hbase queries - events.static.linuxfound.org ·...

hbase user group #9: hbase and hdfs

hbase operations