building a database for the end of the world
TRANSCRIPT
![Page 1: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/1.jpg)
Building a Database for the End of the World
New England Java Users Group — John Hugg — November 10th, 2016
![Page 2: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/2.jpg)
Who Am I?
• Engineer #1 at
• Responsible for many poor decisions and even a few good ones.
• @johnhugg
• http://chat.voltdb.com
![Page 3: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/3.jpg)
Mission Maybepossible
• Got a paper
• Re-invent OLTP database
• Can we make it happen?
• BTW: 10X or go home
![Page 4: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/4.jpg)
Operational Databases
![Page 5: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/5.jpg)
Operational Databases
![Page 6: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/6.jpg)
2008 Time Machine
![Page 7: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/7.jpg)
2008 Assumptions
• Multicore is the future.
• CPUs are getting faster faster than cache is getting faster faster than RAM is getting faster faster than disk is getting faster.
• Specialized systems can be 10x better.
• Having lots of RAM isn’t weird.
• This “cloud” thing is a thing, but with lousy hardware.
![Page 8: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/8.jpg)
B U F F E R P O O L M A N A G E M E N T
C O N C U R R E N C Y
U S E M A I N M E M O R Y
S I N G L E T H R E A D E D
Waiting on users leaves CPU idle
Single threaded doesn’t jive with multicore world
![Page 9: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/9.jpg)
W A I T I N G O N U S E R S
• Don’t.
• External transactions control and performance are not friends with each other.
• Use server side transactional logic.
• Move the logic to the data, not the other way around.
![Page 10: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/10.jpg)
U S I N G * A L L * T H E C O R E S
• Partitioning data is a requirement for scale-out.
• Single-threaded is desired for efficiency.
• Why not partition to the core instead of the node?
• Concurrency via scheduling, not shared memory.
![Page 11: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/11.jpg)
Partition Data Serial Execution Never Block
Many CPUs across a cluster.
Each running a pipeline of work.
Do one thing after another with no
overhead.
Run code next to data so never block
for logic.
Data in memory so never block on disk.
Keep CPUs full of real work and you win.
![Page 12: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/12.jpg)
Partitioning• Two kinds of tables, partitioned and replicated.
• Partitioned tables have a column partitioning key.
• Two kinds of transactions, partitioned and global.
• Partitioned transactions are routed to the data partition they need.
• Global transactions can read and update all partitions transactionally.
Table & Index Data
Execution Engine
WorkQueue
![Page 13: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/13.jpg)
Clustering
• Partition to CPU cores. More machines = more cores.
• "Buddy up" cores across machines for HA.
• Fully synchronous replication within a cluster.
• Asynchronous replication across clusters (WAN).
• Partitioned workloads parallelize linearly.
![Page 14: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/14.jpg)
Clustering
VoltDB Cluster
Server 1
Partition 1 Partition 2 Partition 3
Server 2
Partition 4 Partition 5 Partition 6
Server 3
Partition 7 Partition 8 Partition 9
![Page 15: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/15.jpg)
Commodity Cluster Scale Out / Active-Active HA
Millions of ACID serializable operations per second
Synchronous Disk Persistence
Avg latency under 1ms, 99.999 under 50ms.
Multi-TB customers in production.
Lesson #1
![Page 16: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/16.jpg)
Lesson #2
Explaining the tech to people is not a good way to sell something.
![Page 17: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/17.jpg)
Important Point• VoltDB is not just a traditional RDBMS
with some tweaks sitting on RAM rather than spindles.
• VoltDB is weird and new and exciting and not compatible with Hibernate.
• VoltDB sounds like MemSQL, NuoDB, Clustrix, HANA or whomever on first blush, but has a really really different architecture.
![Page 18: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/18.jpg)
So we tried to sell this thing…
![Page 19: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/19.jpg)
Our Market• MySQL can’t do it.
• Analytic RDBMSs can’t do it.
• Hadoop can’t do it.
(BUT VoltDB isn’t a drop in for MySQL)
• No sprawling apps built with Hibernate.
• No websites where reads are 95%.
What’s left?
![Page 20: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/20.jpg)
B I G D ATA
Fast Data Big Data
![Page 21: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/21.jpg)
More on OLTP vs OLAP
• Nobody wants black-box state. Real-time understanding has value.
• OLTP apps smell like stream processing apps.
• Processing and state management go well together.
• By following customers, we ended up with a fantastic streaming analytics / stream processing tool.
• Strong consistency & transactions make streaming better.
At High Velocity
![Page 22: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/22.jpg)
What is Fast Data
• Digital Ad Tech
• Smart Devices / IoT / Sensors
• Financial Exchange Streams
• Telecommunications
• Online Gaming
• High Write Throughput
• Partitionable Actions
• Global Live Understanding
• Long Term Storage in HDFS or Analytic RDBMS
![Page 23: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/23.jpg)
• Global Live Understanding
Materialized Views for Live Aggregation
Special index support for ranking functions Function based indexes
• Long Term Storage in HDFS or Analytic RDBMS
“Export” to HDFS, CSV, JDBC, Specific systems
HTTP/JSON Queries for easy dashboards
Snapshot to CSV / HDFS
Full SQL support for operating on JSON docs as
column values
![Page 24: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/24.jpg)
“Export”
• Looks like write only tables in your schema.
• Each is really a persistent message queue.
• Can be connected to consumers:
• HDFS, CSV, JDBC, HTTP, Vertica Bulk
![Page 25: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/25.jpg)
What About Java?Time For Implementation Choices
![Page 26: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/26.jpg)
Time For Implementation Choices
Decision Implication
No external transaction control. Multi-statement transactions use stored procedures.
![Page 27: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/27.jpg)
Stored Procedure Needs• Easy for us (VoltDB devs) to implement
Rules out bespoke options, like our own PL-SQL or DSL
• Not slowRules out Ruby, Python, etc…
• Can’t crash the system easilyRules out C, C++, Fortran
• Familiar or easy to learn if notRules out Erlang and some weird stuff
• Has to exist in 2008Rules out Rust, Swift, Go
• Runs on Linux (in 2008)Rules out .Net languages
So:
Java, Lua, or JavaScript
![Page 28: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/28.jpg)
We Picked Java
try { Object rawResult = m_procMethod.invoke(m_procedure, paramList); results = getResultsFromRawResults(rawResult);} catch (InvocationTargetException e) { ...}
![Page 29: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/29.jpg)
We picked Java• Once we picked Java as the user stored procedure language, we
decided to implement much of the system in Java.
• 2008 Java was much more appealing than 2008 C++ to write SQL optimizers, procedure runtimes, transaction lifecycle management, networking interfaces, and all the other trappings.
• C++ & Lua or C++ & Javascript were less appealing
• Note:C++17 is much improved. Rust is interesting. Swift might be good in 2020.
![Page 30: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/30.jpg)
EXAMPLE TIME
![Page 31: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/31.jpg)
A Real User
Ed Ed’s Talk
![Page 32: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/32.jpg)
How many unique devices opened up
my app today?
![Page 33: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/33.jpg)
![Page 34: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/34.jpg)
T H E ZKSC S TA C K ?
![Page 35: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/35.jpg)
2007 Conference on Analysis of Algorithms, AofA 07 DMTCS proc. AH, 2007, 127–146
HyperLogLog: the analysis of a near-optimalcardinality estimation algorithm
Philippe Flajolet1 and Éric Fusy1 and Olivier Gandouet2 and FrédéricMeunier1
1Algorithms Project, INRIA–Rocquencourt, F78153 Le Chesnay (France)2LIRMM, 161 rue Ada, 34392 Montpellier (France)
This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated toestimating the number of distinct elements (the cardinality) of very large data ensembles. Using an auxiliary memoryof m units (typically, “short bytes”), HYPERLOGLOG performs a single pass over the data and produces an estimateof the cardinality such that the relative accuracy (the standard error) is typically about 1.04/
pm. This improves on
the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64%of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond 10
9
with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally andadapts to the sliding window model.
Introduction
The purpose of this note is to present and analyse an efficient algorithm for estimating the number ofdistinct elements, known as the cardinality, of large data ensembles, which are referred to here as multisetsand are usually massive streams (read-once sequences). This problem has received a great deal of attentionover the past two decades, finding an ever growing number of applications in networking and trafficmonitoring, such as the detection of worm propagation, of network attacks (e.g., by Denial of Service),and of link-based spam on the web [3]. For instance, a data stream over a network consists of a sequenceof packets, each packet having a header, which contains a pair (source–destination) of addresses, followedby a body of specific data; the number of distinct header pairs (the cardinality of the multiset) in varioustime slices is an important indication for detecting attacks and monitoring traffic, as it records the numberof distinct active flows. Indeed, worms and viruses typically propagate by opening a large number ofdifferent connections, and though they may well pass unnoticed amongst a huge traffic, their activitybecomes exposed once cardinalities are measured (see the lucid exposition by Estan and Varghese in [11]).Other applications of cardinality estimators include data mining of massive data sets of sorts—naturallanguage texts [4, 5], biological data [17, 18], very large structured databases, or the internet graph, wherethe authors of [22] report computational gains by a factor of 500+ attained by probabilistic cardinalityestimators.1365–8050 c� 2007 Discrete Mathematics and Theoretical Computer Science (DMTCS), Nancy, France
A method of estimating cardinality with O(1) space/time.
blob = update(hashedVal, blob)
integer = estimate(blob)
A few kilobytes to get 99% accuracy.
Enter HyperLogLog
![Page 36: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/36.jpg)
App IdentifierUnique Device ID
appid = 87deviceid = 12
VoltDB Server
Stored Procedure“CountDeviceEstimate”
State“estimates”
CREATE&TABLE&estimates(&&appid&&&&&&&bigint&&&&&&&&&&NOT&NULL,&&devicecount&bigint&&&&&&&&&&NOT&NULL,&&hll&&&&&&&&&varbinary(8192)&DEFAULT&NULL,&&PRIMARY&KEY&(appid));PARTITION&TABLE&estimates&ON&COLUMN&appid;CREATE&INDEX&rank&ON&ESTIMATES&(devicecount);
![Page 37: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/37.jpg)
CountDeviceEstimate
![Page 38: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/38.jpg)
Example in the kit
![Page 39: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/39.jpg)
chat.voltdb.com@johnhugg
Example: Telco
Mobile phone is dialed.
Request sent to VoltDB to decide if it
should be let through.
Single transaction looks at state and decides if this call:
is fraudulent? is permitted under plan?
has prepaid balance to cover?
State Blacklists
Fraud Rules Billing Info
Recent Activity for both Numbers
Export to OLAP
99.999% of txns respond in 50ms
![Page 40: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/40.jpg)
chat.voltdb.com@johnhugg
Example: Micro Personalization
User clicks link on a website. This
generates a request to VoltDB.
VoltDB transaction scans a table of rules and
checks which apply to this event.
Eventually the transaction decides what
to show the user next.
That decision is exported to HDFS
Spark ML is used to look at historical data in HDFS and
generate new rules.
These rules are loaded into VoltDB every few hours.
User sees personalized
content
![Page 41: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/41.jpg)
More JavaImplementation Details
![Page 42: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/42.jpg)
Cool Java Benefit: Hot Swap Code
• Java classloaders are pretty cool.
• Where code needs to be dynamically changed, we setup one custom classloader per thread.
• Transitioning to a new Jarfile can be done asynchronously.
• Happy to talk more about this.
![Page 43: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/43.jpg)
Cool Java Benefit: Debuggers
• Can debug in Eclipse or other IDEs, stepping through user code.
![Page 44: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/44.jpg)
First Java Problem: Heap
• We’re building an in-memory database.
• Users storing 128GB of data in memory isn’t crazy.
• 128GB Java heap is no fun. Very hard to avoid long GC pauses.
• Multi-lifecycle data is the worst possible case.
![Page 45: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/45.jpg)
Java Garbage Collection
128GB Relational Data
![Page 46: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/46.jpg)
The Data Mullet Was Born
Networking, Txn-Mgmt in Java
User Procedures
Data Storage in C++
![Page 47: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/47.jpg)
The Data Mullet Was Born
Networking, Txn-Mgmt in Java
User Procedures
Data Storage in C++
2-8GB
120GB
Per transaction stuff Config + other stuff
that lasts a while
![Page 48: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/48.jpg)
Details Rathole
![Page 49: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/49.jpg)
Off-Heap Storage• Direct ByteBuffers
• Pooled Direct ByteBuffers
• A full persistence layer with good caching (possibly even with ORM)
• Use a FOSS/COTS in-memory, in-process thingy database/key-value/cache/etc...
• Build your own storage layer in native code. (a last resort)
![Page 50: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/50.jpg)
VoltDB C++ Storage Engine
• Class called VoltDBEngine manages tables, indexes, hot-snapshots, etc...
• Accepts pseudo-compiled SQL statements and modifies or queries data.
• Clear defined interface over JNI
• Java heap is 1-4GB, C++ stores up to 1TB.
![Page 51: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/51.jpg)
How to Debug• Abstract JNI interface and implement it over sockets
One mixed-lang process becomes two.
• Can use GDB/Valgrind/XCode/EclipseCDT/etc...
• If the problem only reproduces in JNI or in a distributed system, we resort too often to printf / log4j.
• Goal is to keep C++ code as simple and task-focused as possible so horrible native bugs are the exception, not the rule.
![Page 52: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/52.jpg)
How to Profile• This is the big downside to non-trivial JNI.
• Much performance tuning is generic. (auto-measure)
• oprofile/perf have gotten recently better at C++ in JNI.
• Sampling in Java gives best results, less clear with many threads.
• Profiling one thread doesn’t always inform multi-thread performance.
• Profiling release build confusing. Debug build is off.
• Isolate and micro-benchmark/micro-profile if possible.
![Page 53: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/53.jpg)
Related Problem: Serialization
• Subproblem: How do you represent a row-based, relational table in Java?
• Subproblem: Best way to serialize POJOs?
![Page 54: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/54.jpg)
How do you represent a row-based, relational table in Java?
• Array of arrays of objects is often the wrong answer.
• We serialize rows by native type to a ByteBuffer with a binary header format. Lazy de-serialization.
• Since we support variable-sized rows, we’ve made this buffer append-only.
• No great way to use a library like protobufs for this. Avro close?
![Page 55: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/55.jpg)
What about POJOs?• java.io.Serializable is slow. Needs classloading.
• java.io.Externalizable is the right idea.
• VoltDB breaks fast serializing into two steps:
• How big are you?
• Flatten to this buffer (Externalizable-style)
• Prefix with type/length indicators when needed.
• Protobufs, Avro, Thrift, MessagePack, Parquet
• JSON
![Page 56: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/56.jpg)
![Page 57: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/57.jpg)
OLTP Data Fits In Memory
• Memory is getting cheaper faster than OLTP data is growing.
• Need to split up your app though. Driven by scale pain.
• There is value is ridiculously consistent performance.
![Page 58: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/58.jpg)
IMDB vs.
Cache + K/V
• Some apps have hot-cold patterns with lots of cold data.
• NVRAM is coming.
![Page 59: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/59.jpg)
Recovery From Peers, Not Disk
• No disk persistence is a non-starter.
![Page 60: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/60.jpg)
Full Disk Persistence in 2.0
![Page 61: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/61.jpg)
Disk. Check. Now?
• Recovery from peers is actually pretty cloud-friendly.
• All nodes the same with identical failure and replacement semantics has been a big win.
![Page 62: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/62.jpg)
Cluster Native and Commodity Boxes VM and Cloud Friendly
• Clustering is hard. So so hard. Especially if you care about consistency or availability.
• Monitoring clusters is still something many users aren’t good at.
• Debugging clusters is hard, especially beyond key-value stores.
• Partitioning is getting easier to explain/sell thanks to NoSQL.
• I’m super skeptical about automated partitioners for operational work.
• Alternative is 1TB mega-machines? PCIe networks/fabrics?
• “What, you don’t have 1000 node clusters?”
![Page 63: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/63.jpg)
External Transaction Control is an Anti-Pattern
• Downside: We self-disqualify from all of the ORM apps out there.
• Upside: We self-disqualify from all of the ORM apps out there.
• Server-side logic is a really good fit for event processing and the log-structured world.
![Page 64: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/64.jpg)
Active-Active Intra-Cluster Through Deterministic Logical Replication
• V1 used clock-sync to globally order transactions.
• Basing replication on clocks was a no-go unless you’re Google.
• Sync latency was too slow.
• Now more like Raft.
• Traded a global pre-order for global post-order.
• Happy with where we ended up.
![Page 65: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/65.jpg)
OLTP Doesn’t Need Long Running Transactions
• Big engineering wins to a single-threaded SQL execution engine.
• Lots of people want long transactions, though many apps do without.
• Drives us to integrate.
• Fat fingers are problematic.
• Added ability to set timeouts globally or per-call.
• Biggest differentiator. Real transactions. Real throughput. Low Latency.
![Page 66: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/66.jpg)
An OLTP-Focused Database Needs Much Less SQL Support
• Always supported more powerful state manipulation and queries than NoSQL.
• Always got compared to mature RDBMS.
• In 2014, our SQL got rich enough to for us to switch to offense. Only took 6 years or so.
![Page 67: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/67.jpg)
Other Lessons
![Page 68: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/68.jpg)
SQL Means it’s easy to switch DBs, right? Right?
![Page 69: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/69.jpg)
Request/Response vs Log-Structured
![Page 70: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/70.jpg)
Hybrids a thing?
Kappa Architecture is kind of a thing.
CQRS is kind of a thing.
Lambda Architecture is a thing (die die die)
Latency?
Simplicity?vs.
![Page 71: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/71.jpg)
How Hard is Integration?
![Page 72: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/72.jpg)
![Page 73: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/73.jpg)
How Hard is Integration?
• I hate you Google.
• Integrating two things for one use case is easier than integrating two things as a vendor.
• Using a vendor-supplied integration is often much smarter than building your own. Others are using it. The Vendor tests it. Etc…
![Page 74: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/74.jpg)
Let’s Ingest from Kafka
Kafka Kafkaloader VoltDB
• Manage acks to ensure at least once delivery, even when any of the three pieces fails in any way.
• 1 Kafka “topic” routed to one table or one ingest procedure.
![Page 75: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/75.jpg)
But Clusters…
Kafka Kafkaloader
VoltDBKafka
KafkaVoltDB
VoltDB
![Page 76: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/76.jpg)
That middle guy was lame…
Kafka
VoltDBKafka
Kafka
VoltDB
VoltDB
Kafkaloader
Kafkaloader
Kafkaloader
• VoltDB nodes are elected leaders for Kafka topics.
• If any failure on either side happens, VoltDB coordinates to resume work.
• Guarantee at least once when used correct.
• Leverage ACID to get idempotency to get effective exactly once delivery.
![Page 77: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/77.jpg)
Users want more!
Kafka
VoltDBKafka
Kafka
VoltDB
VoltDB
Kafkaloader
Kafkaloader
Kafkaloader
• But what if data for many tables shares a topic?
• What if message content dictates how it should be processed?
User Code?
User Code?
User Code?
![Page 78: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/78.jpg)
Integrations So Far
• OLAP, like Vertica, Netezza, Teradata
• Generic, like JDBC, MySQL
• ElasticSearch
• Kafka, RabbitMQ, Kinesis
• HDFS/Spark and Hadoop Ecosystem
• CSV and raw sockets
• HTTP APIs
• Various AWS things
![Page 79: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/79.jpg)
Still More Java
![Page 80: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/80.jpg)
Shipping VoltDB• Wrap core VoltDB jar with python scripts.
Looks like Hadoop tools or Cassandra
• Wrap native libraries for Linux + macOS in the jarStole this idea from libsnappy
• You can use one Jar in eclipse to test VoltDB apps. Same jar as client lib Same jar as JDBC driver
JNI Binaries
![Page 81: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/81.jpg)
Networking
• Some apps have one client connection. Some apps have 5000.
• Some clients are long lived. Some are transient.
• VoltDB is often bottlenecked on # packets, not just throughput.
![Page 82: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/82.jpg)
Our Implementation• Use NIO to handle worst case client load.
Small penalty when handling best case.
• One network for some highest priority internal traffic, one for everything else.
• Used pooled direct Byte Buffers for network IO.
• Dedicated network threads (proportional to cores).
• Use ListenableFutureTasks for serialization in dedicated threads.
• Split NIO Selectors on many-core.
![Page 83: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/83.jpg)
Latency• Example SLA:
99.999% of txns return in 50ms
• Chief problems:
• Garbage collection
• Operational events
• Non-java compaction and cleanup
Ariel Weisberg at Strangeloop 2014 https://www.youtube.com/watch?v=EmiIUW4splQ
![Page 84: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/84.jpg)
Latency: Mullet Revisited
Networking, Txn-Mgmt in Java
User Procedures
Data Storage in C++
2-8GB
120GB
Per transaction stuff Config + other stuff
that lasts a while
![Page 85: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/85.jpg)
Operational Latency
• Initiating a snapshot used to take 200ms. Better now.
• Failing a node gracefully used to take about 1s. Better now.
• Failing a node by cutting it’s cord can take even longer.
• Some operational events require restart.
![Page 86: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/86.jpg)
Further Discussions• VoltDB scales well to 16 cores, then starts to scale less well to 32, and it’s not ideal at 64.
We have lots of thoughts about this and I could talk more about it.Some of it is Java. Some not. Customers don’t care much yet?
• Fragmentation in native memory allocation has been a big issue for us. It’s not much of a Java issue, but is interesting.
• When to use an off the shelf tool vs when to roll own.
• We’ve run into people who are resistant to using JVM software or writing stored procs in Java.
• Kafka has 4 different popular versions. Had to use OSGI module loading. Ugh.
![Page 87: Building a Database for the End of the World](https://reader038.vdocuments.us/reader038/viewer/2022110220/58f10c8a1a28abe47f8b4583/html5/thumbnails/87.jpg)
chat.voltdb.com
forum.voltdb.com
askanengineer @voltdb.com
@johnhugg @voltdb
all images from wikimedia w/ cc license unless otherwise noted
Thank You!• Please ask me questions now or later.
• Feedback on what was interesting, helpful, confusing, boring is ALWAYS welcome.
• Happy to talk about: Data management Systems software dev Distributed systems Japanese preschools
BS
Stuff I Don't Know
Stuff I Know
T H I S TA L K