elasticsearch cluster deep dive

ElasticsearchCluster deep dive

NoSQL: Text Search and Document

Elasticsearch cluster

Cluster documentation

Distributed

Client Nodes

Data Nodes

Master Nodes

Ingest Nodes

Today view of the cluster

Other Nodes

Master Nodes

What happen when a node starts?

Starting


E

D

A

B

C

Starting1. Get a list of nodes to ping from config

Master


E

D

A

B

C

Starting1. Get a list of nodes to ping from config2. Each response contains:

a. cluster nameb. node detailsc. master node detailsd. cluster state version


E

D

A

B

C

Starting1. Get a list of nodes to ping from config2. Each response contains:

a. cluster nameb. node detailsc. master node detailsd. cluster state version

3. Only keeps master eligible responses based on discovery.zen.master_election.ignore_non_master_pings


E

D

A

B

C

Starting● List of master nodes: [C, C]● List of eligible master nodes: [A, B, C]


E

D

A

B

C

Starting1. Join master node (C) sending:

internal:discovery/zen/join


E

D

A

B

C


internal:discovery/zen/join2. Master validates join sending:

internal:discovery/zen/join/validate

Cluster state update


E

D

A

B

C




3. Master update the cluster state with the new node


E

D

A

B

C





4. Master waits for discovery.zen.minimum_master_nodes master eligible to respond


E

D

A

B

C





4. Master waits for discovery.zen.minimum_master_nodes master eligible to respond

5. Change commited and confirmation sent


E

D

A

B

C

Starting1. New node check the received state for

a. new master nodeb. no master node in the state

Master fault detection

E

D

F A

B

C

Started● Every discovery.zen.fd.ping_interval

nodes ping master (default 1s)● Timeout is

discovery.zen.fd.ping_timeout (default 30s)

● Retry is discovery.zen.fd.ping_retries (default is 3)

Node fault detection

E

D

F A

B

C

Started● Every discovery.zen.fd.ping_interval

nodes ping master (default 1s)● Timeout is

discovery.zen.fd.ping_timeout (default 30s)

● Retry is discovery.zen.fd.ping_retries (default is 3)

Master election

Minimum of candidate required

Master election

E

D

F A

B

C

Network Partition

E

D

F A

B

C Master election cannot happen, master steps down

Network Partition

E

D

F A

B

C

Master fault detection triggers new master election

Master election

1. Based on the list of master eligible nodes it chooses in priority:a. The node with the higher cluster state version (part of the ping response)b. Master eligible nodec. Sort alphabetically the id of the remaining a take the first

2. Sends a join to this new master. In the meantime it accumulates join requests

If the current node elected itself as master it waits for the minimum join requests to declare itself as master (discovery.zen.minimum_master_nodes)

In case of master failure detection, each node removes the failed master from the candidates.

Latest cluster version

Lost update partially fixed in 5.0 found by jepsen test

E

D

F A

B

C v18

v18

v18


E

D

F A

B

C v18

v19

v19


E

D

F A

B

C v18

v20

v20


E

D

F A

B

C v19

v20

v20


E

D

F A

B

C v19

v20

v20

Cannot become the master

Shard allocation

Shard assigned to new node

1. Master will rebalance shard allocation to have:a. same average number of shard per nodeb. same average of shard per index per node avoiding 2 shard with the

same id on the same node2. Uses deciders to decide which shard goes where based on

a. Hot/Warm setup (time based indices)b. Disk usage allocation (low watermark and high watermark)c. Throttling (node is already recovering, master might again later)

Shard initialization (Primary)

1. Master communicate through cluster state a new shard assignment2. Node initialize an empty shard3. Node notify the master4. Master mark the shard as started5. If this is the first shard with a specific id, it is marked as primary is

receives requests

Shard initialization (Replica)

1. Master communicate through cluster state a new shard assignment2. Node initialize recovery from the primary3. Node notify the master4. Master mark the replica as started5. Node activate the replica

Shard recovery

Shard

S1S2S3

DISK

Memory

S1S2S3

Commit point

In memory buffer

Translog

Recovery from primary

Node with Primary Node with Replica

Start Recovery

1. Validate request2. Prevent translog from deletion3. Snapshot Lucene



Start Recovery


Segments



Start Recovery


Segments

Translog



Start Recovery


Segments

Translog

Notifies master

Thank you !

elasticsearch cluster deep dive

Technology