self healing data
DESCRIPTION
With the raise of NoSQL databases consistency models that are less strict than ACID transactions became popular again. After the first enthusiasm the developer community became aware that those relaxed consistency models hold some new challenges they never knew about in the ACID world. Fortunately there are some concepts around how to deal with those challenges. This presentation gives a rough introduction into the different consistency models that are available and their characteristics. Then it focusses on two techniques to deal with relaxed consistency. The first one are quorum-based reads and writes which provides a client-side strong consistency model (even if the database only implements eventual consistency). Then CRDTs (Conflict-free Replicated Data Types) are presented. CRDTs are self-stabilizing data structures which are designed for environments with extremely high availability requirements - and thus extremely weak consistency guarantees. Even though as always the majority of the information is on the voice track, I designed the slides in a way that they also provide some useful information without the voice.TRANSCRIPT
![Page 1: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/1.jpg)
Self-healing data An introduction to consistency models, Quorums and CRDTs
Uwe Friedrichsen, codecentric AG, 2014
![Page 2: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/2.jpg)
@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com
![Page 3: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/3.jpg)
Why NoSQL?
• Scalability
• Easier schema evolution
• Availability on unreliable OTS hardware
• It‘s more fun ...
![Page 4: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/4.jpg)
Why NoSQL?
• Scalability
• Easier schema evolution
• Availability on unreliable OTS hardware
• It‘s more fun ...
![Page 5: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/5.jpg)
Challenges • Giving up ACID transactions
• (Temporal) anomalies and inconsistencies short-term due to replication or long-term due to partitioning
It might happen. Thus, it will happen!
![Page 6: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/6.jpg)
C Consistency
A Availability
P Partition Tolerance
Strict Consistency
ACID / 2PC
Strong Consistency
Quorum R&W / Paxos
Eventual Consistency
CRDT / Gossip / Hinted Handoff
![Page 7: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/7.jpg)
Strict Consistency (CA) • Great programming model
no anomalies or inconsistencies need to be considered
• Does not scale well best for single node databases
„We know ACID – It works!“
„We know 2PC – It sucks!“
Use for moderate data amounts
![Page 8: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/8.jpg)
And what if I need more data?
• Distributed datastore
• Partition tolerance is a must
• Need to give up strict consistency (CP or AP)
![Page 9: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/9.jpg)
Strong Consistency (CP) • Majority based consistency model
can tolerate up to N nodes failing out of 2N+1 nodes
• Good programming model Single-copy consistency
• Trades consistency for availabilityin case of partitioning
Paxos (for sequential consistency)
Quorum-based reads & writes
![Page 10: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/10.jpg)
Quorum-based reads & writes • N – number of nodes (replicas)
• R – number of reads delivering the same result required
• W – number of confirmed writes required
W > N / 2
R + W > N
![Page 11: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/11.jpg)
Example 1 • 3 replica nodes
This is: N = 3
• W > N / 2 à W > 3 / 2 à W > 1,5
Let‘s pick: W = 2 (can tolerate failure of 1 node)
• R + W > N à R + 2 > 3 à R > 1
Let‘s pick: R = 2 (can tolerate failure of 1 node)
![Page 12: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/12.jpg)
Client
Node 1 Node 3 Node 2
![Page 13: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/13.jpg)
W W W
Client
Node 1 Node 3 Node 2
![Page 14: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/14.jpg)
Client
Node 1 Node 3 Node 2
✔ ✖ ✔
W = 2 ✔
![Page 15: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/15.jpg)
Client
Node 1 Node 3 Node 2
![Page 16: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/16.jpg)
R R R
Client
Node 1 Node 3 Node 2
✔ ✔ ✔
![Page 17: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/17.jpg)
Client
Node 1 Node 3 Node 2
2x / 1x ✔
R = 2 ✔
![Page 18: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/18.jpg)
Client
Node 1 Node 3 Node 2
![Page 19: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/19.jpg)
R R R
Client
Node 1 Node 3 Node 2
✔ ✔ ✖
![Page 20: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/20.jpg)
Client
Node 1 Node 3 Node 2
1x / 1x ✖
R = 1 ✖
![Page 21: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/21.jpg)
Example 2 • 3 replica nodes
This is: N = 3
• W > N / 2 à W > 3 / 2 à W > 1,5
Let‘s pick: W = 3 (strict consistency – does not tolerate any failure)
• R + W > N à R + 3 > 3 à R > 0
Let‘s pick: R = 1 (single read from any node sufficient)
![Page 22: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/22.jpg)
Example 3 • 5 replica nodes
This is: N = 5
• W > N / 2 à W > 5 / 2 à W > 2,5
Let‘s pick: W = 3 (can tolerate failure of 2 nodes)
• R + W > N à R + 3 > 5 à R > 2
Let‘s pick: R = 3 (can tolerate failure of 2 nodes)
![Page 23: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/23.jpg)
Limitations of QB R&W • Requires fixed set of replica nodes
Dynamic adding and removal of nodes not possible
• Client-centric consistency Consistency only perceived on client, not on nodes
• Requires additional measures on nodes Nodes need to implement at least eventual consistency
Use if client-perceived strong consistency is
sufficient and simplicity trumps formal precision
![Page 24: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/24.jpg)
And what if I need more availability?
• Need to give up strong consistency (CP)
• Relax required consistency properties even more
• Leads to eventual consistency (AP)
![Page 25: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/25.jpg)
Eventual Consistency (AP) • Gives up some consistency guarantees
no sequential consistency, anomalies become visible
• Maximum availability possiblecan tolerate up to N-1 nodes failing out of N nodes
• Challenging programming model anomalies usually need to be resolved explicitly
Gossip / Hinted Handoffs
CRDT
![Page 26: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/26.jpg)
Conflict-free Replicated Data Types • Eventually consistent, self-stabilizing data structures
• Designed for maximum availability
• Tolerates up to N-1 out of N nodes failing
State-based CRDT: Convergent Replicated Data Type (CvRDT)
Operation-based CRDT: Commutative Replicated Data Type (CmRDT)
![Page 27: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/27.jpg)
A bit of theory first ...
![Page 28: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/28.jpg)
Convergent Replicated Data Type State-based CRDT – CvRDT
• All replicas (usually) connected
• Exchange state between replicas, calculate new state on target replica
• State transfer at least once over eventually-reliable channels
• Set of possible states form a Semilattice • Partially ordered set of elements where all subsets have a Least Upper Bound (LUB)
• All state changes advance upwards with respect to the partial order
![Page 29: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/29.jpg)
Commutative Replicated Data Type Operation-based CRDT - CmRDT
• All replicas (usually) connected
• Exchange update operations between replicas, apply on target replica
• Reliable broadcast with ordering guarantee for non-concurrent updates
• Concurrent updates must be commutative
![Page 30: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/30.jpg)
That‘s been enough theory ...
![Page 31: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/31.jpg)
Counter
![Page 32: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/32.jpg)
Op-based Counter Data
Integer i
Init i ≔ 0
Query return i
Operations increment(): i ≔ i + 1 decrement(): i ≔ i - 1
![Page 33: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/33.jpg)
State-based G-Counter (grow only) (Naïve approach)
Data
Integer i
Init i ≔ 0
Query return i
Update increment(): i ≔ i + 1
Merge( j) i ≔ max(i, j)
![Page 34: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/34.jpg)
State-based G-Counter (grow only) (Naïve approach)
R1
R3
R2
i = 1
U
i = 1
U
i = 0
i = 0
i = 0
I
I
I
M
i = 1 i = 1
M
![Page 35: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/35.jpg)
State-based G-Counter (grow only) (Vector-based approach)
Data
Integer V[] / one element per replica set
Init V ≔ [0, 0, ... , 0]
Query return ∑i V[i]
Update increment(): V[i] ≔ V[i] + 1 / i is replica set number
Merge(V‘) ∀i ∈ [0, n-1] : V[i] ≔ max(V[i], V‘[i])
![Page 36: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/36.jpg)
State-based G-Counter (grow only) (Vector-based approach)
R1
R3
R2
V = [1, 0, 0]
U
V = [0, 0, 0]
I
I
I
V = [0, 0, 0]
V = [0, 0, 0]
U
V = [0, 0, 1]
M
V = [1, 0, 0]
M
V = [1, 0, 1]
![Page 37: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/37.jpg)
State-based PN-Counter (pos./neg.) • Simple vector approach as with G-Counter does not work
• Violates monotonicity requirement of semilattice
• Need to use two vectors • Vector P to track incements
• Vector N to track decrements
• Query result is ∑i P[i] – N[i]
![Page 38: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/38.jpg)
State-based PN-Counter (pos./neg.) Data
Integer P[], N[] / one element per replica set
Init P ≔ [0, 0, ... , 0], N ≔ [0, 0, ... , 0]
Query Return ∑i P[i] – N[i]
Update increment(): P[i] ≔ P[i] + 1 / i is replica set number decrement(): N[i] ≔ N[i] + 1 / i is replica set number
Merge(P‘, N‘) ∀i ∈ [0, n-1] : P[i] ≔ max(P[i], P‘[i]) ∀i ∈ [0, n-1] : N[i] ≔ max(N[i], N‘[i])
![Page 39: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/39.jpg)
Non-negative Counter Problem: How to check a global invariant with local information only? • Approach 1: Only dec when local state is > 0
• Concurrent decs could still lead to negative value • Approach 2: Externalize negative values as 0
• inc(negative value) == noop(), violates counter semantics • Approach 3: Local invariant – only allow dec if P[i] - N[i] > 0
• Works, but may be too strong limitation • Approach 4: Synchronize
• Works, but violates assumptions and prerequisites of CRDTs
![Page 40: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/40.jpg)
Sets
![Page 41: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/41.jpg)
Op-based Set (Naïve approach)
Data
Set S
Init S ≔ {}
Query(e) return e ∈ S
Operations add(e) : S ≔ S ∪ {e} remove(e): S ≔ S \ {e}
![Page 42: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/42.jpg)
Op-based Set (Naïve approach)
R1
R3
R2
S = {e}
add(e)
S = {}
S = {}
S = {}
I
I
I
S = {e}
S = {}
rmv(e)
add(e)
add(e) add(e) rmv(e)
add(e)
S = {e}
S = {e} S = {e} S = {}
![Page 43: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/43.jpg)
State-based G-Set (grow only) Data
Set S
Init S ≔ {}
Query(e) return e ∈ S
Update add(e): S ≔ S ∪ {e}
Merge(S‘) S = S ∪ S‘
![Page 44: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/44.jpg)
State-based 2P-Set (two-phase) Data
Set A, R / A: added, R: removed
Init A ≔ {}, R ≔ {}
Query(e) return e ∈ A ∧ e ∉ R
Update add(e): A ≔ A ∪ {e} remove(e): (pre query(e)) R ≔ R ∪ {e}
Merge(A‘, R‘) A ≔ A ∪ A‘, R ≔ R ∪ R‘
![Page 45: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/45.jpg)
Op-based OR-Set (observed-remove) Data
Set S / Set of pairs { (element e, unique tag u), ... }
Init S ≔ {}
Query(e) return ∃u : (e, u) ∈ S
Operations add(e): S ≔ S ∪ { (e, u) } / u is generated unique tag remove(e):
pre query(e) R ≔ { (e, u) | ∃u : (e, u) ∈ S } /at source („prepare“) S ≔ S \ R /downstream („execute“)
![Page 46: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/46.jpg)
Op-based OR-Set (observed-remove)
R1
R3
R2
S = {ea}
add(e)
S = {}
S = {}
S = {}
I
I
I
S = {eb}
S = {}
rmv(e)
add(e)
add(eb) add(ea) rmv(ea)
add(eb)
S = {eb}
S = {eb} S = {ea, eb} S = {eb}
![Page 47: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/47.jpg)
More datatypes • Register
• Dictionary (Map)
• Tree
• Graph
• Array
• List
plus more representations for each datatype
![Page 48: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/48.jpg)
Garbage collection • Sets could grow infinitely in worst case
• Garbage collection possible, but a bit tricky
• Only remove updates that can be deleted safely
and were received by all replicas
• Usually implemented using vector clocks
• Can tolerate up to n-1 crashes
• Live only while no replicas are crashed
• Can induce surprising behavior sometimes
• Sometimes stronger consensus is needed (Paxos, ...)
![Page 49: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/49.jpg)
Limitations of CRDTs • Very weak consistency guarantees
Strives for „quiescent consistency“
• Eventually consistentNot suitable for high-volume ID generator or alike
• Not easy to understand and model
• Not all data structures representable
Use if availability is extremely important
![Page 50: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/50.jpg)
Further reading 1. Shapiro et al., Conflict-free Replicated
Data Types, Inria Research report, 2011
2. Shapiro et al., A comprehensive study of Convergent and Commutative Replicated Data Types, Inria Research report, 2011
3. Basho Tag Archives: CRDT, https://basho.com/tag/crdt/
4. Leslie Lamport, Time, clocks, and the ordering of events in a distributed system, Communications of the ACM (1978)
![Page 51: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/51.jpg)
Wrap-up
• CAP requires rethinking consistency
• Strict Consistency ACID / 2PC
• Strong Consistency Quorum-based R&W, Paxos
• Eventual Consistency CRDT, Gossip, Hinted Handoffs
Pick your consistency model based on your consistency and availability requirements
![Page 52: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/52.jpg)
The real world is not ACID Thus, it is perfectly fine to go for a relaxed consistency model
![Page 53: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/53.jpg)
@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com
![Page 54: Self healing data](https://reader034.vdocuments.us/reader034/viewer/2022052410/554ba5bbb4c905b3618b4eb9/html5/thumbnails/54.jpg)