dbos: it’s time for a principled approach to distributed

DBOS: It’s Time for a Principled Approach to Distributed Systems State

Kostis Kaffes1, Peter Kraft1, Qian Li1, Athinagoras Skiadopoulos1, Daniel Hong2, Shana Mathew2, Michael Cafarella2, Vijay Gadepally2,

Goetz Graefe3, Jeremy Kepner2, Christos Kozyrakis1, Tim Kraska2, Michael Stonebraker2, Lalith Suresh4, Matei Zaharia1

(1Stanford, 2MIT, 3Google, 4VMWare)Feb. 11, 2021

Problem: Distributed Systems are Hard!

•Hard to build, maintain, extend, and optimize.• Basic features take years to develop for the best

engineers and teams in industry and academia.• Examples: distributing a scheduler, partitioning a file

system namespace, peer discovery, anything involving security, and much more.

Why are Distributed Systems Hard?

•Distributed systems lack a principled approach to managing state.• State includes cluster configurations, task and

worker metadata, user data, etc.• All system operations act on state.

Overview1. Describe bad state management.2. Propose a radical approach to good state

management: centralize all state in a DBMS.3. Present case studies showing the benefits of our

proposal to core system operations (e.g. scheduling).

4. Discuss proof-of-concept experiments showing this can work in practice.

Three Sins of Bad State Management1. Dividing state across many disjoint data stores.2. Providing weak abstractions for manipulating state.3. Using data stores that cannot scale.

Sin 1: Dividing state across many disjoint data stores.• Example: OpenWhisk/Kubernetes divide state between

stores (CouchDB, Consul, ZK, Kafka, etc.) and in-memory data structures.•Makes cross-cutting operations hard.• OpenWhisk monitoring requires ad-hoc interfaces on

every system component.• Kubernetes scheduler isn’t NUMA aware because it no

one can communicate NUMA state to it.

Sin 2: Providing weak abstractions for manipulating state.

• Systems missing basic primitives for state manipulation (e.g. atomic updates).• Example: Distributed schedulers must reinvent

concurrency control (over in-memory state).• Example: Analytics over file systems requires

exporting metadata to an external database.

Sin 3: Using data stores that cannot scale.

• Systems centralize all state on a “master” node.•Master bottlenecks performance for large clusters.• Example: Spark scheduler capped at 1K tasks/second.

Two Principles of Good State Management

1. Centralize system state and user data in a uniform data model as database tables in a distributed DBMS.

2. Execute all operations on state as DBMS transactions invoked from otherwise stateless processes.

Distributed DBMS

Linux/Hardware

Well-Managed State

Filesystem Scheduler Communication Layer

Auditor

Application Layer

A Radical Redesign

Prior Systems• Design data structures to store

state.

DBMS-based Systems• Design schemas and indexes

to store state.

A Radical Redesign

state.• Use RPCs, manual concurrency

control to modify state.

to store state.• Modify state using DBMS

transactions.

A Radical Redesign

control to modify state.• Divide user data across file

systems, remote stores.

transactions.• Store user data in the DBMS

blob store.

A Radical Redesign

control to modify state.• Divide user data across file

systems, remote stores.• Analyze data with ad-hoc

monitoring interfaces, log parsers.

transactions.• Store user data in the DBMS

blob store.• Analyze state in the DBMS

directly!

Our Proposal: DBOS•We are designing DBOS, a cluster operating system

based on these two principles.•We hope future systems can use it as a framework to

more easily build a system with good state management.• Just starting work on DBOS, details unclear!

Distributed DBMS

Linux/Hardware

DBOS on the Stack

Filesystem Scheduler Communication Layer

Auditor

Application Layer

Benefits of Good State Management

No more ad-hoc approach!• Extensibility• Introspection

https://twitter.com/redpenblackpen/status/875100791165648898/photo/1

Benefits of Good State Management• Extensibility• Introspection

No more ad-hoc approach!

https://twitter.com/redpenblackpen/status/875100791165648898/photo/1

Benefits of Good State Management• Extensibility• Introspection

No more ad-hoc approach!

DBOS approach+ Easier to maintain, extend, and optimize+ Easier to monitor, analyze, and debug

Case Study I: Cluster Scheduling

• A scheduler = a stored procedure interacts with DB+Easy to add new dimensions across layers: NUMA,

heterogenous hardware, data locality+Faster innovation using query interface: DCM (OSDI’20) --

scalable, flexible new algorithms+Debuggability: “Find hot spots: workers have CPU utilization

higher than X and temperature higher than Y?”

select * from Workerwhere CpuUtil > X and CpuTemp > Y;

Case Study II: Serverless Task I/O

•No more hacky, high-overhead solutions - Rendezvous server, TCP hole punching, S3-based I/O

•Query task location, pass messages through DB+Easy to support more patterns: broadcast, aggregation,...+Easy to optimize: batching, parallelization,...

Case Study III: File System

• FS stores both user data and metadata in a DBMS+Easy to implement various types of data stores (block vs.

object) by changing DB schemas+Adaptive to workload changes (block size, indexes)+Native support for efficient analytics: “find all files

belonging to user X and larger than Y bytes”

select FileName from Filewhere UserName = X and FileSize > Y;

Proof of Concept• A simple FS using VoltDB• Store all data (UserName, FileName, UserData,...<metadata>) in

a single File table• Two synthetic benchmarks: create and read 1KB files

insert into File values (user, fileName, userData,...);

select UserData from Filewhere UserName=user and FileName=file;

Proof of Concept• Partition File table across 40 parallel partitions on 2 servers

0 1 2Throughput (⇥106 TPS)

(us) Median

Create files

0 1 2Throughput (⇥106 TPS)

(us) Median

Read 1KB files

Takeaway: VoltDB delivers sub-millisecond latency and sustain 1M+ operations/second per server

The Time for Revolution is Now• High performance datacenter hardware• Database systems are finally ready -- NewSQL• Low latency, high throughput, distributed, scalable, in-memory

transactional DBMSs• Any distributed DBMS that supports:

• ACID transactions, rigorous query semantics (e.g., SQL)• Scalability: tables partitioned across nodes• Low latency: data mostly be memory-resident

• Examples• VoltDB -- millions of transactions/sec at low latency• SingleStore(MemSQL), ClustrixDB, NuoDB, CockroachDB, ...

The Time for Revolution is Now• Limitations• Limited support for heterogenous storage formats• Multi-tenant interference on shared resources

• Active research areas: we are optimistic J• Polystores: uniform interface over diverse formats• Performance isolation, admission control

Research Directions

• Starting to build DBOS• Many open questions

•DBOS will enable:• Security and privacy• Data provenance (GDPR compliance)• Self-adaptivity using ML/RL

Conclusion

•Distributed systems are hard to build, maintain, extend, and scale• An extreme solution: manage all state centrally in a

distributed DBMS• Can be practical with today’s distributed DBMSs

Questions?Read more: https://arxiv.org/abs/2007.11112

DBOS: It’s Time for a Principled Approach to Distributed Systems State!

dbos: it’s time for a principled approach to distributed

Documents

conflict resolution and principled negotiation

punitive damages: toward a principled approach

the principled patriarch (preview)

principled schedulability analysis for distributed storage...

principled practices in microfinance

a principled leader - bowdoin college

principled polysemy

principled pragmatism - ocha

principled leader behaviors: an integrative …

exploring a principled pathway to the future a...

achieving heightened standards within principled

principled workflow-centric tracing of distributed systems

principled portfolio solutions sm exchange traded products...

distributed storage performance for openstack clouds:...

supporting principled humanitarian action

what is principled innovation?

a principled technologies test report

a principled approach to accountability

principled simplicity

principled architecture selection for neural networks:...