elasticsearch cluster deep dive
TRANSCRIPT
ElasticsearchCluster deep dive
NoSQL: Text Search and Document
Elasticsearch cluster
Cluster documentation
Distributed
Client Nodes
Data Nodes
Master Nodes
Ingest Nodes
Today view of the cluster
Other Nodes
Master Nodes
What happen when a node starts?
Starting
What happen when a node starts?
E
D
A
B
C
Starting1. Get a list of nodes to ping from config
Master
What happen when a node starts?
E
D
A
B
C
Starting1. Get a list of nodes to ping from config2. Each response contains:
a. cluster nameb. node detailsc. master node detailsd. cluster state version
What happen when a node starts?
E
D
A
B
C
Starting1. Get a list of nodes to ping from config2. Each response contains:
a. cluster nameb. node detailsc. master node detailsd. cluster state version
3. Only keeps master eligible responses based on discovery.zen.master_election.ignore_non_master_pings
What happen when a node starts?
E
D
A
B
C
Starting● List of master nodes: [C, C]● List of eligible master nodes: [A, B, C]
What happen when a node starts?
E
D
A
B
C
Starting1. Join master node (C) sending:
internal:discovery/zen/join
What happen when a node starts?
E
D
A
B
C
Starting1. Join master node (C) sending:
internal:discovery/zen/join2. Master validates join sending:
internal:discovery/zen/join/validate
Cluster state update
What happen when a node starts?
E
D
A
B
C
Starting1. Join master node (C) sending:
internal:discovery/zen/join2. Master validates join sending:
internal:discovery/zen/join/validate
3. Master update the cluster state with the new node
What happen when a node starts?
E
D
A
B
C
Starting1. Join master node (C) sending:
internal:discovery/zen/join2. Master validates join sending:
internal:discovery/zen/join/validate
3. Master update the cluster state with the new node
4. Master waits for discovery.zen.minimum_master_nodes master eligible to respond
What happen when a node starts?
E
D
A
B
C
Starting1. Join master node (C) sending:
internal:discovery/zen/join2. Master validates join sending:
internal:discovery/zen/join/validate
3. Master update the cluster state with the new node
4. Master waits for discovery.zen.minimum_master_nodes master eligible to respond
5. Change commited and confirmation sent
What happen when a node starts?
E
D
A
B
C
Starting1. Join master node (C) sending:
internal:discovery/zen/join2. Master validates join sending:
internal:discovery/zen/join/validate
3. Master update the cluster state with the new node
4. Master waits for discovery.zen.minimum_master_nodes master eligible to respond
5. Change commited and confirmation sent
What happen when a node starts?
E
D
A
B
C
Starting1. New node check the received state for
a. new master nodeb. no master node in the state
Master fault detection
E
D
F A
B
C
Started● Every discovery.zen.fd.ping_interval
nodes ping master (default 1s)● Timeout is
discovery.zen.fd.ping_timeout (default 30s)
● Retry is discovery.zen.fd.ping_retries (default is 3)
Node fault detection
E
D
F A
B
C
Started● Every discovery.zen.fd.ping_interval
nodes ping master (default 1s)● Timeout is
discovery.zen.fd.ping_timeout (default 30s)
● Retry is discovery.zen.fd.ping_retries (default is 3)
Master election
Minimum of candidate required
Master election
E
D
F A
B
C
Network Partition
E
D
F A
B
C Master election cannot happen, master steps down
Network Partition
E
D
F A
B
C
Master fault detection triggers new master election
Master election
1. Based on the list of master eligible nodes it chooses in priority:a. The node with the higher cluster state version (part of the ping response)b. Master eligible nodec. Sort alphabetically the id of the remaining a take the first
2. Sends a join to this new master. In the meantime it accumulates join requests
If the current node elected itself as master it waits for the minimum join requests to declare itself as master (discovery.zen.minimum_master_nodes)
In case of master failure detection, each node removes the failed master from the candidates.
Latest cluster version
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v18
v18
v18
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v18
v19
v19
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v18
v20
v20
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v18
v20
v20
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v18
v20
v20
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v19
v20
v20
Lost update partially fixed in 5.0 found by jepsen test
E
D
F A
B
C v19
v20
v20
Cannot become the master
Shard allocation
Shard assigned to new node
1. Master will rebalance shard allocation to have:a. same average number of shard per nodeb. same average of shard per index per node avoiding 2 shard with the
same id on the same node2. Uses deciders to decide which shard goes where based on
a. Hot/Warm setup (time based indices)b. Disk usage allocation (low watermark and high watermark)c. Throttling (node is already recovering, master might again later)
Shard initialization (Primary)
1. Master communicate through cluster state a new shard assignment2. Node initialize an empty shard3. Node notify the master4. Master mark the shard as started5. If this is the first shard with a specific id, it is marked as primary is
receives requests
Shard initialization (Replica)
1. Master communicate through cluster state a new shard assignment2. Node initialize recovery from the primary3. Node notify the master4. Master mark the replica as started5. Node activate the replica
Shard recovery
Shard
S1S2S3
DISK
Memory
S1S2S3
Commit point
In memory buffer
Translog
Recovery from primary
Node with Primary Node with Replica
Start Recovery
1. Validate request2. Prevent translog from deletion3. Snapshot Lucene
Recovery from primary
Node with Primary Node with Replica
Start Recovery
1. Validate request2. Prevent translog from deletion3. Snapshot Lucene
Segments
Recovery from primary
Node with Primary Node with Replica
Start Recovery
1. Validate request2. Prevent translog from deletion3. Snapshot Lucene
Segments
Translog
Recovery from primary
Node with Primary Node with Replica
Start Recovery
1. Validate request2. Prevent translog from deletion3. Snapshot Lucene
Segments
Translog
Notifies master
Thank you !