splice machine: architecture of an open source rdms powered by hbase and spark

45
Architecture of an Open Source RDBMS powered by HBase and Spark January 12, 2017 Spark Meetup Barcelona Daniel Gómez Ferro

Upload: daniel-gomez-ferro

Post on 13-Apr-2017

197 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

Architecture of an Open Source RDBMS powered by

HBase and Spark

January 12, 2017

Spark Meetup Barcelona

Daniel Gómez Ferro

Page 2: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Introduction to Splice Machine

▪ 1.x: the need for Spark

▪ 2.0: Spark introduction, challenges and wins

▪ Future

2

Agenda

Page 3: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Splice Machine

▪ Distributed database company

▪ Open source

▪ VC-backed

▪ Offices in San Francisco and St Louis (MO)

3

Who are we?

Page 4: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

4

What do we do?

The Open Source RDBMS Powered By Hadoop & Spark

ANSI SQLNo retraining or rewrites for SQL-based

analysts, reports, and applications

¼ the Cost Scales out on

commodity hardware

SQL Scale Out Speed

TransactionsEnsure reliable updates

across multiple rows

Mixed WorkloadsSimultaneously support

OLTP and OLAP workloads

ElasticIncrease scale in

just a few minutes

10x FasterLeverages Spark

in-memory technology

Page 5: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

Timeline

5

Where are we?

Founded2012

v1.0Nov 2014

v1.5Oct 2015

v2.0Open Source

Spark integrationJul 2016

v2.5Feb? 2017

Page 6: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

1.x: the need for Spark

Page 7: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ SQL Parser

▪ SQL Planner

▪ SQL Optimizer

▪ Data storage

▪ Execution Engine

7

RDBMS building blocks

Page 8: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ SQL Parser

▪ SQL Planner

▪ SQL Optimizer

▪ Data storage

▪ Execution Engine

8

RDBMS building blocks

Apache Derby

Apache HBase

Page 9: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Distributed Sorted Map▪ Keys and Values are byte[]

▪ Lexicographically sorted (like a dictionary)▪ [] < [0x00; 0x00] < [0x01] < [0x01, 0x00]

▪ Key range is partitioned/sharded in ‘regions’

▪ Operations

▪ get(byte[] key) -> byte[]

▪ put(byte[] key, byte[] value)

▪ scan(byte[] start, byte[] stop) -> scanner

▪ Extensions

▪ Coprocessors

9

HBase Introduction

Page 10: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

10

HBase Architecture

HDFS

Page 11: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Sits on top of HDFS▪ Distributed append-only Filesystem

▪ Files are immutable once closed

11

HBase challenges

Page 12: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Sits on top of HDFS▪ Distributed append-only Filesystem

▪ Files are immutable once closed

▪ Log-Structured Merge Tree

▪ Multi level storage▪ Inserts are buffered in memory

▪ Flush this buffer to disk, writing indexed, sorted files

12

HBase challenges

In-memorybuffer[a; m]

File 1 [a; p]

File 2 [f; z]

Page 13: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Sits on top of HDFS▪ Distributed append-only Filesystem

▪ Files are immutable once closed

▪ Log-Structured Merge Tree

▪ Multi level storage▪ Inserts are buffered in memory

▪ Flush this buffer to disk, writing indexed, sorted files

13

HBase challenges

File 3 [a; m]

File 1 [a; p]

File 2 [f; z]In-memory

buffer[]

Page 14: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

14

HBase Architecture

HDFS

Page 15: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

15

HBase Architecture

▪ HRegion▪ MemStore

▪ One or more HFiles

▪ HLog

▪ Writes

▪ Add it to the MemStore

▪ Write it to the HLog

▪ When the MemStore gets big enough▪ Flush: dump the MemStore into a new HFile

▪ Reads

▪ In parallel from the MemStore and all HFiles

Page 16: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ We reused several Derby components▪ JDBC driver▪ SQL Parser/Planner/Optimizer

▪ In-memory data formats

▪ Bytecode generation

▪ Developed some custom solutions

▪ TEMP table for transient data (joins, aggregates, etc.)

▪ Task framework (using HBase’s coprocessors)

▪ Connection pooling

▪ Switched Derby’s datastore for HBase▪ Primary Keys and Indexes make use of HBase’s sorting order

▪ Removing Derby’s assumptions about running on a single machine...

16

Derby - HBase integration

Page 17: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Great for▪ Operational workloads▪ Replacing non-scalable RDBMS solutions

▪ SQL support▪ SQL 99, Indexes, Triggers, Foreign Keys, cost based optimizer...

17

Splice Machine 1.x

Page 18: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Great for▪ Operational workloads▪ Replacing non-scalable RDBMS solutions

▪ SQL support▪ SQL 99, Indexes, Triggers, Foreign Keys, cost based optimizer...

▪ But...

▪ Struggled on analytical queries

▪ HBase’s compactions created instabilities

▪ Minimum latency was too high (due to Task Framework)

18

Splice Machine 1.x

Page 19: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Great for▪ Operational workloads▪ Replacing non-scalable RDBMS solutions

▪ SQL support▪ SQL 99, Indexes, Triggers, Foreign Keys, cost based optimizer...

▪ But...

▪ Struggled on analytical queries

▪ HBase’s compactions created instabilities

▪ Minimum latency was too high (due to Task Framework)

19

Splice Machine 1.x

Page 20: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

2.0: Spark introduction, challenges and wins

Page 21: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Challenging but natural▪ Matched tree of database operators with RDD transformations

21

Spark Integration

Aggregate

Join

Scan Restriction

Scan

reduceByKey()

join()

newAPIHadoopRDD() filter()

newAPIHadoopRDD()

Page 22: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

22

▪ Abstracted away the Spark API

▪ Two implementations▪ In-memory using Guava’s FluentIterable APIs▪ Distributed using Spark

▪ SQL operations have a single implementation

▪ In-memory use case:▪ OLTP workloads▪ Very low latency▪ Bring data in, perform computation locally

▪ Anti-pattern in distributed systems, but it works

Unified API

Page 23: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Got rid of TEMP table▪ Spark maintains temporary data in memory

▪ Got rid of Task Framework▪ Spark performs the same job, less complexity

▪ Resource isolation▪ HBase and Spark in separate processes

▪ Analytical queries don’t impact as much HBase stability

23

Spark Integration Benefits

Page 24: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Many serialization boundaries: poor performance▪ HDFS -> HBase -> Spark -> Client

▪ Task granularity▪ Too coarse

▪ Multiple Spark contexts▪ One per HRegionServer

▪ Derby legacy issues▪ Custom datatypes▪ Thread context assumptions

24

Integration Problems

Page 25: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Remove serialization boundaries

▪ Hybrid scanners: ▪ Custom InputFormat that reads HFiles directly from HDFS into Spark▪ Merges those values with a fast scanner on the MemStore▪ Most data: HDFS -> Spark▪ Small part: HBase (in-memory) -> Spark

▪ Requires some hooks in HBase▪ Compactions remove HFiles, Spark might be reading them▪ Flushes add HFiles

▪ Much better read performance

25

Solving: serialization boundaries

Page 26: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Increase task granularity

▪ HTableInputFormat default is:▪ 1 region = 1 partition▪ Each region could be 1 GB or more

▪ SpliceInputFormat subdivides regions into blocks (default 32Mb)▪ Better parallelism▪ Better performance

▪ This also needs hooks in HBase (coprocessors)

26

Solving: task granularity

Page 27: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Single shared Spark context

▪ JobServer wasn’t good enough▪ It would become a bottleneck, results would go through it

▪ Custom JobServer (called OLAPServer)▪ Single Spark context on this server▪ Currently colocated with the HMaster (fault tolerant for free)▪ Makes Spark jobs stream results directly to the client▪ Runs several partitions in parallel▪ Starts streaming as soon as there’s data

27

Solving: multiple Spark contexts

Page 28: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

28

JobServer vs OLAPServerTi

me

JobServerStart partition 1Next rowNext row…End partition 1, sendStart partition 2Next rowNext row…End partition 2, send

ClientRun partition 1

Get resultsRun partition 2

Get results

During this time the client is blocked waiting for more data

Page 29: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

29

JobServer vs OLAPServerTi

me

JobServerStart partition 1Next rowNext row…End partition 1, sendStart partition 2Next rowNext row…End partition 2, send

ClientRun partition 1

Get resultsRun partition 2

Get results

OLAPServerStart partition 1, 2, 3Get and send rowGet and send row…End partition 1Start partition 4Get and send rowGet and send row…End partition 2, 3Start partition 5, 6

ClientRun partition 1, 2, 3Get resultGet result

Run partition 4

Get resultGet result

Run partition 5, 6

Page 30: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

30

Single shared Spark context

Page 31: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ Custom datatypes:▪ Custom Kryo serializers for Derby objects

▪ Thread contexts▪ Not completely solved▪ Use TaskContext.addTaskCompletionListener() to cleanup after ourselves▪ Still finding resource leaks from time to time...

31

Solving: Derby legacy issues

Page 32: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

▪ HBase compactions in Spark:▪ HBase compactions can be expensive

▪ Reading and writing lots of data

▪ If they happen in the HBase JVM they can kill OLTP performance

▪ We made possible running them in Spark▪ Maintaining data locality▪ Scheduled among other jobs

▪ Fallback to HBase if the Spark scheduler doesn’t have resources

32

Other Spark goodies

Page 33: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

33

Other Spark goodies

▪ Integration with Spark streaming:

▪ We can ingest data directly from Spark Streaming

▪ Easy to write to Splice Machine from Kafka through it

Page 34: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

Future work and roadmap

Page 35: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

35

▪ Move to DataFrame APIs▪ Catalyst optimizer▪ Whole stage code generation (better than Derby’s codegen)▪ Already transitioned some operations▪ Requires good UnsafeRow support

▪ UnsafeRow▪ Compact in memory representation

▪ Rows are a serialized in a continuous block of memory

▪ Better memory management▪ Less GC time

Future Spark work

Page 36: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

36

Future Spark work

Row1

Current

DataDes[]

SQLDate SomeOtherField

SQLInteger

SQLInteger

UnsafeRow

MemoryBlock: byte[]

Row10, 100

Row2100, 50

Page 37: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

37

▪ Columnar storage format

▪ We already have ‘Pinned’ tables:▪ Create Parquet snapshot of table▪ Get columnar access▪ Good for read-only data

▪ Planning on maintaining dual representation▪ Row-oriented for recently written values▪ Column-oriented for historical data▪ Merge those on the fly

Future Spark work

Page 38: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

38

▪ Better Spark shell integration

▪ Our SparkContext resides in the OLAPServer

▪ Getting data to a Spark shell incurs a serialization boundary▪ From Splice’s SparkContext to the shell context

▪ We want to achieve transparent conversion▪ ResultSet -> DataFrame

Future Spark work

Page 39: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

39

▪ Performance increases across the board▪ TPCC, TPCH, Backup/Restore, ODBC driver…

▪ Incremental backup

▪ Native PL/SQL support (in Beta)▪ No excuses left for migrating those Oracle databases

▪ Client load balancing/failover▪ Via HAProxy

▪ Statistics improvements▪ Histograms, sketching libraries

▪ RDD caching (pinning)

2.5 Roadmap

Page 40: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

Thank You!

We are hiring :-)

Page 42: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

42

Page 43: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

TPCH 100 Load Throughput

Page 44: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

TPCH 100 Query Times (seconds)Query

1 395 TRAFODION-2237 99

2 PHOENIX-3322 516 44

3 PHOENIX-3322 TRAFODION-2237 126

4 PHOENIX-3322 TBD 133

5 PHOENIX-3322 TBD 192

6 74 3178 38

7 PHOENIX-3322 4442 220

8 PHOENIX-3322 TRAFODION-2239 620

9 PHOENIX-3322 941 273

10 PHOENIX-3322 TRAFODION-2241 101

11 PHOENIX-3317 463 56

Page 45: Splice Machine: Architecture of an Open Source RDMS powered by HBase and Spark

TPCH 100 Query Times (seconds)Query

12 379 TBD 85

13 PHOENIX-3318 TBD 71

14 PHOENIX-3322 TBD 50

15 PHOENIX-3319 TBD 102

16 PHOENIX-3322 TBD 33

17 PHOENIX-3322 TBD 929

18 PHOENIX-3322 TBD SPLICE-34

19 PHOENIX-3322 TBD 57

20 PHOENIX-3320 TBD SPLICE-410

21 PHOENIX-3321 TBD 479

22 PHOENIX-3322 TBD 219