salvatore sanfilippo – how redis cluster works, and why - nosql matters barcelona 2014
TRANSCRIPT
![Page 1: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/1.jpg)
Redis Clusterdesign tradeoffs @antirez - Pivotal
![Page 2: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/2.jpg)
What is performance?
• Low latency.
• IOPS.
• Operations quality and data model.
![Page 3: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/3.jpg)
Go Cluster
• Redis Cluster must have same Redis use case.
• Tradeoffs are inherently needed in DS.
• CAP? Merge values? Strong consistency and consensus? How to replicate values?
![Page 4: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/4.jpg)
CP systems
Client S1
S2
S3
S4
CAP: consistency price is added latency
![Page 5: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/5.jpg)
CP systems
Client S1
S2
S3
S4
Reply to client after majority ACKs
![Page 6: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/6.jpg)
And… there is the diskS1 S2 S3
Disk Disk Disk
CP algorithms may require fsync-befor-ack. Durability / Consistency not always orthogonal.
![Page 7: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/7.jpg)
AP systems
Client
S1
S2
Eventual consistency with merges? (note: merge is not strictly part of EC)
Client
A = {1,2,3,8,12,13,14}
A = {2,3,8,11,12,1}
![Page 8: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/8.jpg)
Many kinds of consistencies• “C” of CAP is strong consistency.
• It is not the only available tradeoff of course.
• Consistency is the set of liveness and safety properties a given system provides.
• Eventual consistency: like to say nothing at all. What liveness/safety properties if not “C”?
![Page 9: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/9.jpg)
Redis Cluster
Client
A,B,C
A,B,C
Sharding and replication (asynchronous).
A,B,C
D,E,F
D,E,F
D,E,F
![Page 10: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/10.jpg)
Asynchronous replication
Client A,B,C
A,B,C
A,B,C
A,B,C
A,B,C
A,B,C
async ACK
![Page 11: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/11.jpg)
Full Mesh
A,B,C A,B,C
D,E,F D,E,F
• Heartbeats.
• Nodes gossip.
• Failover auth.
• Config update.
![Page 12: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/12.jpg)
No proxy, but redirections
A,B,C D,E,F G,H,I L,M,N O,P,Q R,S,T
Client Client
A? D?
![Page 13: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/13.jpg)
Failure detection
• Failure reports within window of time (via gossip).
• Trigger for actual failover.
• Two main states: PFAIL -> FAIL.
![Page 14: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/14.jpg)
Failure detection
S1
S2
S3
S4
S1 is not responding?S1 = PFAIL
S1 = PFAIL
S1 = PFAIL
![Page 15: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/15.jpg)
Failure detection
S1
S2
S3
S4
PFAIL state propagatesS1 = PFAIL
S1 = PFAIL Reported by:
S2, S4
S1 = PFAIL
![Page 16: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/16.jpg)
Failure detection
S1
S2
S3
S4
PFAIL state propagatesS1 = PFAIL
S1 = FAIL
S1 = PFAIL
![Page 17: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/17.jpg)
Failure detection
S1
S2
S3
S4
Force FAIL stateS1 = FAIL
S1 = FAIL
S1 = FAIL
![Page 18: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/18.jpg)
Global slots config
• A master FAIL state triggers a failover.
• Cluster needs a coherent view of configuration.
• Who is serving this slot currently?
• Slots config must eventually converge.
![Page 19: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/19.jpg)
Raft and failover• Config propagation is solved using ideas from the
Raft algorithm (just a subset).
• Raft is a consensus algorithm built on top of different “layers”.
• Raft paper is already a classic (highly recommended).
• Full Raft not needed for Redis Cluster slots config.
![Page 20: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/20.jpg)
Failover and config
FailedSlave
Slave
Slave
Master
Master
Master
Epoch = Epoch+1(logical clock)
Vote for me!
![Page 21: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/21.jpg)
Too easy?
• Why we don’t need full Raft?
• Because our config is idempotent: when the partition heals we can trow away slots config for new versions.
• Same algorithm is used in Sentinel v2 and works well.
![Page 22: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/22.jpg)
Config propagation
• After a successful failover, new slot config is broadcasted.
• If there are partitions, when they heal, config will get updated (broadcasted from time to time, plus stale config detection and UPADTE messages).
• Config with greater Epoch always wins.
![Page 23: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/23.jpg)
Redis Cluster consistency?
• Eventual consistent: last failover wins.
• In the “vanilla” losing writes is unbound.
• Mechanisms to avoid unbound data loss.
![Page 24: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/24.jpg)
Failure mode… #1
Client A,B,C
A,B,C
A,B,C
Failed
A,B,C
A,B,C
lost write…
![Page 25: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/25.jpg)
Failure mode #2Client
A,B,C
A,B,C
D,E,F
G,H,I
Minority side Majority side
![Page 26: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/26.jpg)
Boud divergencesClient A,B,C
D,E,F
G,H,I
Minority side Majority sideAfter node-tim
eot
![Page 27: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/27.jpg)
More data safety?• OP logging until async ACK received.
• Re-played to master when node turns into slave.
• “Safe” connections, on demand.
• Example SADD (idempotent + commutative).
• SET-LWW foo bar <wall-clock>.
![Page 28: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/28.jpg)
Multi key ops
• Hey hashtags!
• {user:1000}.following {user:1000}.followers.
• Unavailable for small windows, but no data exchange between nodes.
![Page 29: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/29.jpg)
Multi key ops (availability)
• Single key ops: always available during resharding.
• Multi key ops, available if:
• No manual resharding of this hash slot in progress.
• Resharding in progress, but source or destination node have all keys.
• Otherwise we get a -TRYAGAIN error.
![Page 30: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/30.jpg)
{User:1}.key_A {User:2}.Key_B
{User:1}.key_A {User:1}.Key_B
{User:1}.key_A {User:1}.Key_B
SUNION key_A key_B-TRYAGAIN
SUNION key_A key_B… output …
SUNION key_A key_B… output …
![Page 31: Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014](https://reader033.vdocuments.us/reader033/viewer/2022052912/55a201181a28ab47268b458e/html5/thumbnails/31.jpg)
Redis Cluster ETA
• Release Candidate available.
• We’ll go stable in Q1 2015.
• Ask me anything.