![Page 1: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/1.jpg)
DBOS: It’s Time for a Principled Approach to Distributed Systems State
Kostis Kaffes1, Peter Kraft1, Qian Li1, Athinagoras Skiadopoulos1, Daniel Hong2, Shana Mathew2, Michael Cafarella2, Vijay Gadepally2,
Goetz Graefe3, Jeremy Kepner2, Christos Kozyrakis1, Tim Kraska2, Michael Stonebraker2, Lalith Suresh4, Matei Zaharia1
(1Stanford, 2MIT, 3Google, 4VMWare)Feb. 11, 2021
1
![Page 2: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/2.jpg)
Problem: Distributed Systems are Hard!
•Hard to build, maintain, extend, and optimize.• Basic features take years to develop for the best
engineers and teams in industry and academia.• Examples: distributing a scheduler, partitioning a file
system namespace, peer discovery, anything involving security, and much more.
2
![Page 3: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/3.jpg)
Why are Distributed Systems Hard?
•Distributed systems lack a principled approach to managing state.• State includes cluster configurations, task and
worker metadata, user data, etc.• All system operations act on state.
3
![Page 4: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/4.jpg)
Overview1. Describe bad state management.2. Propose a radical approach to good state
management: centralize all state in a DBMS.3. Present case studies showing the benefits of our
proposal to core system operations (e.g. scheduling).
4. Discuss proof-of-concept experiments showing this can work in practice.
4
![Page 5: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/5.jpg)
Three Sins of Bad State Management1. Dividing state across many disjoint data stores.2. Providing weak abstractions for manipulating state.3. Using data stores that cannot scale.
5
![Page 6: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/6.jpg)
Sin 1: Dividing state across many disjoint data stores.• Example: OpenWhisk/Kubernetes divide state between
stores (CouchDB, Consul, ZK, Kafka, etc.) and in-memory data structures.•Makes cross-cutting operations hard.• OpenWhisk monitoring requires ad-hoc interfaces on
every system component.• Kubernetes scheduler isn’t NUMA aware because it no
one can communicate NUMA state to it.
6
![Page 7: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/7.jpg)
Sin 2: Providing weak abstractions for manipulating state.
• Systems missing basic primitives for state manipulation (e.g. atomic updates).• Example: Distributed schedulers must reinvent
concurrency control (over in-memory state).• Example: Analytics over file systems requires
exporting metadata to an external database.
7
![Page 8: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/8.jpg)
Sin 3: Using data stores that cannot scale.
• Systems centralize all state on a “master” node.•Master bottlenecks performance for large clusters.• Example: Spark scheduler capped at 1K tasks/second.
8
![Page 9: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/9.jpg)
Two Principles of Good State Management
1. Centralize system state and user data in a uniform data model as database tables in a distributed DBMS.
2. Execute all operations on state as DBMS transactions invoked from otherwise stateless processes.
9
![Page 10: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/10.jpg)
Distributed DBMS
Linux/Hardware
Well-Managed State
Filesystem Scheduler Communication Layer
Auditor
10
Application Layer
![Page 11: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/11.jpg)
A Radical Redesign
Prior Systems• Design data structures to store
state.
DBMS-based Systems• Design schemas and indexes
to store state.
11
![Page 12: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/12.jpg)
A Radical Redesign
Prior Systems• Design data structures to store
state.• Use RPCs, manual concurrency
control to modify state.
DBMS-based Systems• Design schemas and indexes
to store state.• Modify state using DBMS
transactions.
12
![Page 13: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/13.jpg)
A Radical Redesign
Prior Systems• Design data structures to store
state.• Use RPCs, manual concurrency
control to modify state.• Divide user data across file
systems, remote stores.
DBMS-based Systems• Design schemas and indexes
to store state.• Modify state using DBMS
transactions.• Store user data in the DBMS
blob store.
13
![Page 14: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/14.jpg)
A Radical Redesign
Prior Systems• Design data structures to store
state.• Use RPCs, manual concurrency
control to modify state.• Divide user data across file
systems, remote stores.• Analyze data with ad-hoc
monitoring interfaces, log parsers.
DBMS-based Systems• Design schemas and indexes
to store state.• Modify state using DBMS
transactions.• Store user data in the DBMS
blob store.• Analyze state in the DBMS
directly!
14
![Page 15: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/15.jpg)
Our Proposal: DBOS•We are designing DBOS, a cluster operating system
based on these two principles.•We hope future systems can use it as a framework to
more easily build a system with good state management.• Just starting work on DBOS, details unclear!
15
![Page 16: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/16.jpg)
Distributed DBMS
Linux/Hardware
DBOS on the Stack
Filesystem Scheduler Communication Layer
Auditor
16
Application Layer
![Page 17: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/17.jpg)
Benefits of Good State Management
No more ad-hoc approach!• Extensibility• Introspection
17
![Page 18: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/18.jpg)
https://twitter.com/redpenblackpen/status/875100791165648898/photo/1
Benefits of Good State Management• Extensibility• Introspection
No more ad-hoc approach!
18
![Page 19: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/19.jpg)
https://twitter.com/redpenblackpen/status/875100791165648898/photo/1
Benefits of Good State Management• Extensibility• Introspection
No more ad-hoc approach!
19
DBOS approach+ Easier to maintain, extend, and optimize+ Easier to monitor, analyze, and debug
![Page 20: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/20.jpg)
Case Study I: Cluster Scheduling
• A scheduler = a stored procedure interacts with DB+Easy to add new dimensions across layers: NUMA,
heterogenous hardware, data locality+Faster innovation using query interface: DCM (OSDI’20) --
scalable, flexible new algorithms+Debuggability: “Find hot spots: workers have CPU utilization
higher than X and temperature higher than Y?”
20
select * from Workerwhere CpuUtil > X and CpuTemp > Y;
![Page 21: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/21.jpg)
Case Study II: Serverless Task I/O
•No more hacky, high-overhead solutions - Rendezvous server, TCP hole punching, S3-based I/O
•Query task location, pass messages through DB+Easy to support more patterns: broadcast, aggregation,...+Easy to optimize: batching, parallelization,...
21
![Page 22: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/22.jpg)
Case Study III: File System
• FS stores both user data and metadata in a DBMS+Easy to implement various types of data stores (block vs.
object) by changing DB schemas+Adaptive to workload changes (block size, indexes)+Native support for efficient analytics: “find all files
belonging to user X and larger than Y bytes”
22
select FileName from Filewhere UserName = X and FileSize > Y;
![Page 23: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/23.jpg)
Proof of Concept• A simple FS using VoltDB• Store all data (UserName, FileName, UserData,...<metadata>) in
a single File table• Two synthetic benchmarks: create and read 1KB files
insert into File values (user, fileName, userData,...);
select UserData from Filewhere UserName=user and FileName=file;
23
![Page 24: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/24.jpg)
Proof of Concept• Partition File table across 40 parallel partitions on 2 servers
0 1 2Throughput (⇥106 TPS)
101
102
103
104
Late
ncy
(us) Median
P99
Create files
0 1 2Throughput (⇥106 TPS)
101
102
103
104
Late
ncy
(us) Median
P99
Read 1KB files
Takeaway: VoltDB delivers sub-millisecond latency and sustain 1M+ operations/second per server
24
![Page 25: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/25.jpg)
The Time for Revolution is Now• High performance datacenter hardware• Database systems are finally ready -- NewSQL• Low latency, high throughput, distributed, scalable, in-memory
transactional DBMSs• Any distributed DBMS that supports:
• ACID transactions, rigorous query semantics (e.g., SQL)• Scalability: tables partitioned across nodes• Low latency: data mostly be memory-resident
• Examples• VoltDB -- millions of transactions/sec at low latency• SingleStore(MemSQL), ClustrixDB, NuoDB, CockroachDB, ...
25
![Page 26: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/26.jpg)
The Time for Revolution is Now• Limitations• Limited support for heterogenous storage formats• Multi-tenant interference on shared resources
• Active research areas: we are optimistic J• Polystores: uniform interface over diverse formats• Performance isolation, admission control
26
![Page 27: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/27.jpg)
Research Directions
• Starting to build DBOS• Many open questions
•DBOS will enable:• Security and privacy• Data provenance (GDPR compliance)• Self-adaptivity using ML/RL
27
![Page 28: DBOS: It’s Time for a Principled Approach to Distributed](https://reader031.vdocuments.us/reader031/viewer/2022012300/61e104a7982012029a5b2d11/html5/thumbnails/28.jpg)
Conclusion
•Distributed systems are hard to build, maintain, extend, and scale• An extreme solution: manage all state centrally in a
distributed DBMS• Can be practical with today’s distributed DBMSs
Questions?Read more: https://arxiv.org/abs/2007.11112
28
DBOS: It’s Time for a Principled Approach to Distributed Systems State!