replication internals: fitting everything together
DESCRIPTION
Replication in MongoDB requires deep integration with almost every part of the codebase, and has important hooks in various systems like storage, indexing, command processing and querying. Most of the replication components have seen a major overhaul recently in order to make further improvements. In this talk we will address what those pieces are, how they interact, and interesting choices made during their design. In this talk we get into the interaction of the replication protocols, commands really, writes and write concern enforcement, consensus (elections/ leader/follower/ majority) behaviors, and down into the depths of oplog generation and application on replicas. While a large part of the talk will be a technical overview of the big pieces we will dive into many important areas in order to ensure better understanding. The audience will be able to greatly affect which areas we focus on during the session, so come with ideas and a focus.TRANSCRIPT
![Page 1: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/1.jpg)
Replication
InternalsFitting Everything Together
![Page 2: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/2.jpg)
2.8, Refactored
● Architecture as of 2.8
● Unit testable; more, and faster, cpp tests
● Many changes (heartbeats, locking, future)
● Interop with 2.6
● Larger replica sets
![Page 3: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/3.jpg)
Large Blocks
● Topology Manager (state machine)
● Replication Coordinator (repl facade)
● Applier (replicate/apply oplog)
● Executor (network, heartbeats, serialization)
● Commands (re-config, init, status, etc)
● External (writes, storage, query, commands)
![Page 4: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/4.jpg)
Blocks
Applier
Topology Manager
Replication
Coordinator
Oplog
CFG
CM
Ds
Write
s
Query
Executo
r
![Page 5: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/5.jpg)
Blocks
Applier
Topology Manager
Replication
Coordinator
Oplog
CFG
CM
Ds
Write
s
Query
Executo
r
![Page 6: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/6.jpg)
Topology
● Maintains Authoritative Stateo Heartbeat, ping, member state
o Roles and transitions
● Contains Decision Logic
● Unit Testable
● Serial AccessTopology Manager
CFG
![Page 7: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/7.jpg)
Examples
● updateConfig
● prepare*Response for commands
● getSyncSource, *
● setFollowerMode (state)
● processHeartbeat
● prepareHeartbeatResponse
![Page 8: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/8.jpg)
PrepareHeartbeatResponseStatus TopologyCoordinatorImpl::prepareHeartbeatResponse(...) {
// Check error conditions, then set response fields …
response->setElectable(!_getMyUnelectableReason(...));
response->setHbMsg(_getHbmsg(...));
response->setTime(...);
response->setOpTime(lastOpApplied);
if (!_syncSource) {
response->setSyncingTo(_syncSource); }
… topology_coordinator_impl.cpp:628
![Page 9: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/9.jpg)
Failover Scenario
Heart
beats P
S
S
Health Check (rsHB)Active Primary
![Page 10: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/10.jpg)
Failover Scenario
Heart
beats P
S
S
Active PrimaryP
Failed Primary
![Page 11: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/11.jpg)
Failover Scenario
Heart
beats Failed
P
S
Health Check (rsHB)
![Page 12: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/12.jpg)
Blocks
Applier
Topology Manager
Replication
Coordinator
Oplog
CFG
CM
Ds
Write
s
Query
Executo
r
![Page 13: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/13.jpg)
Replications Coordinator
● Interface to other subsystems
● Uses executor to scheduleo Commands
o Elections, Initiate, Reconfig
o Role/State Changes
● Unit Testableo With help, requires mocking out bridge for
subsystems
Replication
Coordinator
![Page 14: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/14.jpg)
Blocks
ApplierReplication
Coordinator
OplogC
MD
s
Write
s
Query
Executo
r
Topology ManagerCFG
![Page 15: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/15.jpg)
Examples
● process*Response for commands
● awaitReplication* (for writes or migration)
● isReplEnabled
● canAcceptWrites*
![Page 16: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/16.jpg)
Accepting writesstatic bool checkIsMasterForDatabase(const std::string& db, ...) {
if (!getReplicationCoordinator()->canAcceptWritesForDatabase(db)){
errorDetail->setErrCode(ErrorCodes::NotMaster);
errorDetail->setErrMessage("Not primary while writing to " + ns);
return false;
}
return true;
}
![Page 17: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/17.jpg)
Blocks
Applier
Topology Manager
Replication
Coordinator
Oplog
CFG
CM
Ds
Write
s
Query
Executo
r
![Page 18: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/18.jpg)
Applier
● Reads from *upstream* oplog
● Applier operations transformations
● Mostly unchanged since 2.4
● Includes UpdatePosition commands
Applier
![Page 19: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/19.jpg)
Read + Apply Decoupled
● Background oplog reader thread (net)
● Pool of oplog applier threads (by collection)
Repl Source
Applier
Pool
Buffer
DB4
DB3
DB1 DB2
Local Oplog
Network
![Page 20: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/20.jpg)
Replication Operations
oplog entry (fields):
o = update, o2 = query
{ "ns" : "test.tags",
"op" : "u", "v" : 2, "ts": ...,
"o2" : { "_id" : 1 },
"o" : { "$set" : { "tags.4" : "e" } } }
![Page 21: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/21.jpg)
Blocks
Applier
Topology Manager
Replication
Coordinator
Oplog
CFG
CM
Ds
Write
s
Query
Executo
r
![Page 22: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/22.jpg)
Executor
● Serializes access to Topology state
● Serializes global state changes wrt db writes
● Processes network requests in IO pool
● Supports event/signal notification
![Page 23: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/23.jpg)
Write Request
● Sent by user
● Interpreted by command subsystem
● Checked by replication coordinator
● Executed
● Idempotent entry recorded in oplog
● ~ Replicated
● ~ Possibly verified during user write request
![Page 24: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/24.jpg)
Write Request
ApplierReplication
Coordinator
OplogC
MD
s
Write
s
Query
Executo
r
Topology ManagerCFG
![Page 25: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/25.jpg)
● Topology Manager (state machine)
● Replication Coordinator (repl facade)
● Applier (replicate/apply oplog)
● Executor (network, heartbeats, serialization)
● Commands (re-config, init, status, etc)
● External (writes, storage, query, commands)
![Page 26: Replication Internals: Fitting Everything Together](https://reader033.vdocuments.us/reader033/viewer/2022060201/559b33901a28ab44638b4622/html5/thumbnails/26.jpg)
Thanks
Questions?