intermediate presentation(05/04/15) autonomous failure detection for supporting fault tolerant...

Intermediate Presentation(05/04/15)

Autonomous Failure Detection for Supporting Fault Tolerant Parallel Computation

05/04/15Taura Lab. Master 2nd46432 Yuuki Horita

Background Large-scale computation runs

in parallel on a great number of nodes in

distributed environments (Grid) over a long period of time

High failure rate

• Node / Process Failures

• Network Failures

Fault Tolerance is getting more important

Fault tolerant computing

Failures

Recovery ResumingFailure Detection

The end…Computing

Failure Detection Heartbeat strategy

Y is probably

① A process Y sends a message, called heartbeat, to another process X at regular time interval Thb

② After Y dies, X receives no heartbeat from Y

③ X suspects Y after a certain period of time Thb+Tto from the last receipt of heartbeat

Objective

To design and implement failure detection service for supporting fault-tolerant parallel computation

Contributions propose a new failure detection approach for

fault-tolerant parallel computation high autonomy

address join/leave of procs. support Grid environments with less manual

configurations high consistency

all the procs. obtain consistent failure information

high efficiency more efficient than other autonomous

approaches (the overhead with 313 procs. was at most about 2% where the heartbeat interval is 0.1[s])

Agenda Background Demands / Related Works Our Approach Experiments Summary

Demands for Failure Detection System demand ( : Autonomy)

Adaptability/Fault-tolerance: address join/leave of processes

Accessibility: need less manual configuration Information demand ( : Consistency)

Consistency: must provide consistent information Performance demand ( : Efficiency)

Low overhead: don’t deteriorate application performance

Low detection latency: inform failure events ASAP Accuracy: less false positive

Hierarchical style MDS (Globus Project) NWS [R. Wolski ’97, N.T.Spring ’99]

a single point of failure may lead to system failure

manual configuration may be cumbersome

: Autonomy Problem

Gossip style [R. Renesse’98]

utilize the mechanism of rumor spreading each process sends a gossip message (like

heartbeat) to a randomly selected process periodically

a gossip message includes {node, heartbeat} of all processes node : a process identifier heartbeat : the latest time when some node

received node’s heartbeat

Gossip styleHeartbeats are propagated to all processes in a certain amount of time automatically

each process judges process failure independently

: Consistency Problem

it takes longer to detect failures

: Efficiency Problem

Basic Design Separation of failure detection and information

propagation Each process is monitored by some processes

(Failure-detection phase) If a process detects process failures, it broadcasts the

information (Information-propagation phase)

• the overhead under normal conditions will be low (Efficiency)

• the failure information will be shared (Consistency)

Failure Detection

Each process autonomously acts so that it is always monitored by some processes

Each process requests randomly selected k

neighbor processes to monitor itself (neighbor : directly connectable)

sends heartbeat to them at regular time interval Thb

requests again in the same way if the monitoring process has failed (self-repairing) A → B ：

A sends heartbeats to B( B monitors A )

Information Propagation

flood along the monitoring network

Can we guarantee that the monitoring network is connected ?

no need for extra connections redundant paths for broadcast

(:fault-tolerant) at most 2k messages per proc.

(:scalable)

Connectivity of Monitoring Network

We calculated the probability of disconnectivity of the monitoring network

1.00E- 161.00E- 141.00E- 121.00E- 101.00E- 081.00E- 061.00E- 041.00E- 021.00E+00

1 6 11 16 21 26 31 36

# of nodes

The disconnectivity can be ignored if k >= 3

Support Grid Environments The connectivity between different

networks is often limited (i.e. NAT, Firewall)

Cluster A Cluster B

GatewayGateway

Disconnected!

Support Grid Environments

K monitoring requests

For each process,

any of its neighbor processes should be either monitoring it directly or adjacent to k of its monitoring processes

Intermediate Presentation(05/04/15) monitoring it directly adjacent to k of its monitoring processes

[2, 7]

[1, 2], [4, 5]

2 3 4 5 6 7

2 2 2 2 2 2

2 3 4 5 6 7

1 2 1 1 2 2

neighbor processes

monitoring directly

Intermediate Presentation(05/04/15) monitoring it directly adjacent to k of its monitoring processes

2 4 5 6 7

1 1 1 2 2

1, [7,9]

monitoring directly

2 4 5 7

1 1 1 1

monitoring directly

2 4 5 6 7

1 1 1 2 1

monitoring directly6

Experiment Environment ISTBS Cluster (112 nodes × 2 CPU)

Xeon2.4GHz × 70 + Xeon2.8GHz ×42 105 nodes (7 nodes down) located at Hongo

SHEEP Cluster (65 nodes × 2 CPU) Xeon2.4GHz × 65 65 nodes located at Kashiwa

Internet

SHEEP cluster in Kashiwa

ISTBS cluster in Hongo

Demonstration (Java Applet)

a process

monitoring

lots of processes will die concurrently 3-times (turn black and disappear)

the surviving processes will detect all of the failures (change in color)

processes will repair the broken monitoring relations (add new edges)

connectivity under failures simulate the connectivity of the monitoring

network under some failures check whether monitoring network is connected

when F failures happen concurrently 1.8×109 trials in each case

Connectivity under failures

# of procs. 10 20 40 80 160

k=3, p=0.01 3 4 5 8 13

k=3, p=0.0001 2 2 3 3 4

k=4, p=0.01 4 6 9 14 24

k=4, p=0.0001 3 4 4 6 10

calculated the maximum number of failure where probability of disconnection is less than p

Efficiency measured the execution time of a Fibonacci program

under the following autonomous failure detection service all-to-all Gossip ours

parameters # of processes : 2 ~ 313 k = 3 Thb = 0.1, 1.0[s]

Results (Efficiency)

0 100 200 300 400# of processes(N)

all- to- all (Thb=0.1[s])

gossip(Thb=0.1[s])

ours(Thb=0.1[s])

all- to- all (Thb=1.0[s])

10% overhead (N = 127)

over 5% overhead (N =

The overhead is at most around 2 %

Summary proposed a new failure detection technique

for fault-tolerant parallel computation showed that

our system could be autonomously constructed in Grid environments

our system has high fault-tolerance it is more efficient than other autonomous

approaches

Future Work handling network partitioning sharing load on dynamic process join showing its practicality by implementing

fault-tolerant parallel application using it

Publications 堀田勇樹 , 田浦健次朗 , 近山隆 . 分散環境における耐故障並

列計算を支援する通信ライブラリ . 先進的計算基盤システムシンポジウム (SACSIS2004). May 2004. （ポスター論文）

堀田勇樹 , 田浦健次朗 , 近山隆 . Phoenix プログラミングモデルにおける故障検知機構 . 並列 /分散 /協調処理に関するサマー・ワークショップ (SWoPP2004). July 2004.

堀田勇樹 , 田浦健次朗 , 近山隆 . 耐故障並列計算を支援する自律的な故障検知機構 . 先進的計算基盤システムシンポジウム (SACSIS2005). May 2005. ( 発表予定 )

intermediate presentation(05/04/15) autonomous failure detection for supporting fault tolerant...

Documents

evaluation function in game playing programs m1 yasubumi...

te taura whiri i te reo māori · created date: 12/1/2015...

the ecological tourism management and its impact on the...

proceedings of the j...

1 a distributed task scheduler optimizing data transfer time...

gxp ver.3 tutorial kenjiro taura. what gxp intends to be a...

may/01/2000hips 20001 online computation of critical paths...

adaptive and robust broadcast algorithm takeshi sekiya...

wordpress.com · 2018. 1. 28. · alpha consultants 78 e...

evaluation of bipartite-graph-based web page clustering shim...

a s urvey on i nformation e xtraction from d ocuments u sing...

monte carlo go has a way to go haruhiro yoshimoto (*1)...

solving tsumego on computers m2 hirokazu ishii chikayama &...

‘people matter/ he taura tangata the anglican church …...

a survey of programming frameworks for dynamic grid workflow...

high performance lu factorization for non-dedicated clusters...

state common entrance test cell · prakash alias...

04/10/25parallel and distributed programming1 shared-memory...

an adaptive collective communication suppressing contention...

2003/12/5 1 assisting technologies for program...