torodb: next generation java, mongodb-compatible, nosql & sql database
TRANSCRIPT
ToroDB:
Next Generation JavaMongoDB compatible
NoSQL & SQL Database
Álvaro Hernández <[email protected]>
ToroDB @NoSQLonSQL
About *8Kdata*
● Research & Development in databases
● Consulting, Training and Support in PostgreSQL
● Java Developers, JavaSpecialists.eu, JCrete.org
● About myself: CTO at 8Kdata:@ahachetehttp://linkd.in/1jhvzQ3
www.8kdata.com
ToroDB @NoSQLonSQL
Say you want…
A database with:
● Great functionality
● Consistency (ACID)
● Reliability
● SQL
ToroDB @NoSQLonSQL
… and then also want
A database with:
● NoSQL(MongoDB)
● “Schema-less”
● Scalability
ToroDB @NoSQLonSQL
Fear no more!You can now
have both SQL and NoSQL!
= +
ToroDB @NoSQLonSQL
DEMO!
ToroDB @NoSQLonSQL
ToroDB @NoSQLonSQL
ToroDB in one slide
● Document-oriented, JSON, NoSQL db
● Open source (AGPL). Written in Java
● MongoDB compatibility (wire protocol level)
● Uses PostgreSQL as a storage backend
ToroDB @NoSQLonSQL
Mapping unstructured datato relational
ToroDB @NoSQLonSQL
ToroDB storage internals
{ "name": "ToroDB", "data": { "a": 42, "b": "hello world!" }, "nested": { "j": 42, "deeper": { "a": 21, "b": "hello" } }}
ToroDB @NoSQLonSQL
ToroDB storage internals
The document is split into the following subdocuments:
{ "name": "ToroDB", "data": {}, "nested": {} }
{ "a": 42, "b": "hello world!"}
{ "j": 42, "deeper": {}}
{ "a": 21, "b": "hello"}
ToroDB @NoSQLonSQL
ToroDB storage internals
select * from demo.t_3┌─────┬───────┬────────────────────────────┬────────┐│ did │ index │ _id │ name │├─────┼───────┼────────────────────────────┼────────┤│ 0 │ ¤ │ \x5451a07de7032d23a908576d │ ToroDB │└─────┴───────┴────────────────────────────┴────────┘select * from demo.t_1┌─────┬───────┬────┬──────────────┐│ did │ index │ a │ b │├─────┼───────┼────┼──────────────┤│ 0 │ ¤ │ 42 │ hello world! ││ 0 │ 1 │ 21 │ hello │└─────┴───────┴────┴──────────────┘select * from demo.t_2┌─────┬───────┬────┐│ did │ index │ j │├─────┼───────┼────┤│ 0 │ ¤ │ 42 │└─────┴───────┴────┘
ToroDB @NoSQLonSQL
ToroDB storage internals
select * from demo.structures┌─────┬────────────────────────────────────────────────────────────────────────────┐│ sid │ _structure │├─────┼────────────────────────────────────────────────────────────────────────────┤│ 0 │ {"t": 3, "data": {"t": 1}, "nested": {"t": 2, "deeper": {"i": 1, "t": 1}}} │└─────┴────────────────────────────────────────────────────────────────────────────┘
select * from demo.root;┌─────┬─────┐│ did │ sid │├─────┼─────┤│ 0 │ 0 │└─────┴─────┘
ToroDB @NoSQLonSQL
How data is stored in schema-less
Data normalization
ToroDB @NoSQLonSQL
This is how we store in ToroDB
ToroDB @NoSQLonSQL
Advantages over MongoDB
ToroDB @NoSQLonSQL
ToroDB: native SQL
ToroDB @NoSQLonSQL
Mix-and-match relational & NoSQL
● Use the same database for both your relational data and ToroDB
● Just use separate schemas (if you will)
● Don't write to ToroDB data or metadata tables
● Query with SQL, do joins, whatever!
ToroDB @NoSQLonSQL
Atomic operations
● There is no support for atomic bulk insert/update/delete operations
● Not even with $isolated:“Prevents a write operation that affects multiple documents from yielding to other reads or writes […] You can ensure that no client sees the changes until the operation completes or errors out. The $isolated isolation operator does not provide “all-or-nothing” atomicity for write operations.”http://docs.mongodb.org/manual/reference/operator/update/isolated/
ToroDB @NoSQLonSQL
“Clean” reads
Oh really?
ToroDB @NoSQLonSQL
“Clean” readshttp://docs.mongodb.org/manual/reference/write-concern/#read-isolation-behavior
“MongoDB will allow clients to read the results of a write operation before the write operation returns.”
“If the mongod terminates before the journal commits, even if a write returns successfully, queries may have read data that will not exist after the mongod restarts.”
Thus, MongoDB suffers from dirty reads. But let's call them just “tainted reads”.
ToroDB @NoSQLonSQL
“Clean” reads
What about $snapshot? Nope:
“The snapshot() does not guarantee that the data returned by the query will reflect a single moment in time nor does it provide isolation from insert or delete operations.”
http://docs.mongodb.org/manual/faq/developers/#faq-developers-isolate-cursors
Cursors in ToroDB run in repeatable read, read-only mode:globalCursorDataSource.setTransactionIsolation("TRANSACTION_REPEATABLE_READ"); globalCursorDataSource.setReadOnly(true);
ToroDB @NoSQLonSQL
Replication&
Horizontal scalability(aka sharding)
ToroDB @NoSQLonSQL
ToroDB v0.4
● ToroDB works as a secondary slave of a MongoDB master (or slave, chained rep)
● Implements the full replication protocol (not as an oplog tailable query)
● Replicates from Mongo to a PostgreSQL
ToroDB @NoSQLonSQL
Write scalability(sharding)
● MongoDB's sharding API not implemented yet (roadmap: ToroDB 0.8)
● Will use MongoDB's mongos without modification, as well as config servers
● Currently we implement sharding at the db level, using backends such as Greenplum
ToroDB @NoSQLonSQL
ToroDBThe software
ToroDB @NoSQLonSQL
The software
Written in Java. v0.40 requires Java7, 1.0 will require 8.
Tested with Oracle and IBM JVMs.Anyone from Azul here today? ;)
Distributed as a JAR file (actually, wrapped with shell executables). Future: also EAR to deploy.
ToroDB @NoSQLonSQL
Standing on the shoulders of giants
And also PostgreSQL, Greenplum, JDBC
ToroDB @NoSQLonSQL
Very modular source code
● The app: 20+ modules (Maven)
● Some of them are individually reusable
● Several abstraction layers:➔ D2R (Document 2 Relational)➔ KVDocument (KV docs abstraction)➔ Database/backend (relational)
ToroDB @NoSQLonSQL
MongoWP
● MongoWP is our implementation of MongoDB's wire protocol
● Based on Netty, an excellent, async and high performance NIO framework
● Callback interface for any MongoDB-based “middleware” implementation (ToroDB, proxy...)
ToroDB @NoSQLonSQL
Architecture
ToroDB @NoSQLonSQL
Executor engines
ToroDB @NoSQLonSQL
Performance
ToroDB @NoSQLonSQL
●Amazon c3.8xlarge➔32 virtual CPUs➔60 GB RAM➔2 x 320 GB SSD
●YCSB 0.5.0, only inserts, 10 minutes● WriteConcern {w:1, fsync: true}● Batch size 1000, 1 and 4 threads● MongoDB 3.2 WiredTiger● ToroDB 0.40 Oracle Java 8, PostgreSQL 9.5 (shared_buffers: 15GB, effective_cache: 45GB)
OLTP Benchmark: YCSB
ToroDB @NoSQLonSQL
OLTP Benchmark: YCSB
ToroDB @NoSQLonSQL
● Amazon reviews datasetImage-based recommendations on styles and substitutesJ. McAuley, C. Targett, J. Shi, A. van den HengelSIGIR, 2015
● AWS c4.xlarge (4vCPU, 8GB RAM) 4KIOPS SSD EBS
● 4x shards, 3x config; 4x segments GP
● 83M records, 65GB plain json
Data Analytics Benchmark
ToroDB @NoSQLonSQL
Disk usage
Mongo 3.0, WT, SnappyGP columnar, zlib level 9
table size index size total size0
10000000000
20000000000
30000000000
40000000000
50000000000
60000000000
70000000000
80000000000
Storage requirements
MongoDB vs ToroDB on Greenplum
Mongo
ToroDB on GP
byt
es
ToroDB @NoSQLonSQL
SELECT count( distinct( "reviewerID" ))FROM reviews;
Queries: which one is easier?
db.reviews.aggregate([{ $group: { _id: "reviewerID"}},{ $group: {_id: 1, count: { $sum: 1}}}])
ToroDB @NoSQLonSQL
SELECT "reviewerName", count(*) as reviews FROM reviews GROUP BY "reviewerName" ORDER BY reviews DESC LIMIT 10;
Queries: which one is easier?
db.reviews.aggregate([ { $group : { _id : '$reviewerName', r : { $sum : 1 } } }, { $sort : { r : -1 } }, { $limit : 10 } ], {allowDiskUse: true})
ToroDB @NoSQLonSQL
Query times
3 different queriesQ3 on MongoDB: aggregate fails
27.95 74.87 00
200
400
600
800
1000
1200
9691007
035 13 31
Query duration (s)
MongoDB vs ToroDB on Greenplum
MongoDB
ToroDB on GP
speedup
seco
nd
s
ToroDB @NoSQLonSQL
Tips & TricksLessons Learned
ToroDB @NoSQLonSQL
●Lots of type and value for each of those types to manage: strings, integers, Arrays
●Lots of case we have to handle in the code:➔transformation from document to table data structure
➔transformation to internal query lang➔...
Visitor pattern for document manipulation
ToroDB @NoSQLonSQL
●Smaller methods●Somewhere in the deepest class you can see a huge if {} else if {} ... else {}
●Safely add new types
Visitor pattern for document manipulation
Compiler will tell us if we forget to implement some visitor
ToroDB @NoSQLonSQL
Oracle Java Mission Control.●Great tool in general, low impact on perf. Gives A LOT of information on memory allocation, exceptions thrown, etc
●But quite bad to measure the time spent on methods, as it ignores time spent in native code and IO
●Very coarse-grained
Used tools to monitor performance
ToroDB @NoSQLonSQL
VisualVM●Very fine grained●By Default measures time spent on native code and IO
●Impact on performance:➔Configurable, but high in general➔The performance impact seems to be heterogeneous (some methods are more penalized than others)
Used tools to monitor performance
ToroDB @NoSQLonSQL
● ToroDB uses HashMaps. Keys are the JSON keys● When there is a lookup on a HashMap, the equals must be executed.
● Each key is a String and String#equals is O(1) when both Strings are the same, but O(n) when both Strings are equal but not the same object.
● As a result, we were spending much more time than expected looking for on HashMaps
● We use a pool of keys that guarantees that if two keys equal, they are the same object.
● Cons: Some time is spent on the pool of keys, as they are basically a map.
Document keys & maps
ToroDB @NoSQLonSQL
● ToroDB has to deal with memory pressure when the MongoDB clients produce requests faster than the SQL backend can handle them.
● This is specially important when the client is using the async drivers
● Ideal solution: Make the backend faster● Specially adding async behaviour● But it requires a new non-JDBC driver => Phoebe● Practical solution: To use a back pressure mechanism to make the client be as fast as the backend can be.
Dealing with Memory Pressure
ToroDB @NoSQLonSQL
● It is important to monitor the hotspots● We found some parts of our code that were correct, but very inefficient.➔Some of them were errors (some analysis that were executed twice on different parts of the code)
➔Some operations that we considered faster enough were executed so many times that it was critical to reimplement on a more performant way
Chasing performance problems
ToroDB @NoSQLonSQL
Download, clone, PR, star it!
https://github.com/torodb/torodb
Check our FAQ:
https://github.com/torodb/torodb/wiki/FAQ
ToroDB @NoSQLonSQL