Transcript
Page 1: Fault Tolerant Parallel Data-Intensive Algorithms

Fault Tolerant Parallel Data-Intensive AlgorithmsMucahid Kutlu, Gagan Agrawal, Oguz Kurt (Ohio State University)

Introduction and Motivation

◆ The Mean-Time-To-Failure (MTTF) of the sys- tems is decreasing with growing number of cores.

◆For the future exascale systems, it is being argued that check- pointing and recovery time (with current methods) will even exceed the MTTF.

◆ Algorithm-based fault-tolerance can be alternative method

Our Goal◆ We focus on only fail-stop failures.

◆ We do not use any back up node and continue the process with remaining nodes after the failure.

◆We have two main goals for faster recovery: ◆ minimize the data loss, since the lost data needs to be reread from the storage cluster ◆ minimize re-processing of the lost data

Our Approach- Intelligent Replication

◆ Minimum Data Intersection by dividing data into blocks and distributing them in different processors.

◆ Passive Replicas

- Summarization

◆After processing one block, a summary is generated for that block and sent to the master node.

◆ No need to re-process the blocks that a summary is already sent before the failure.

MasterFile

System

P1 P2 P3 P4

Recovery Scenario

◆ P1 and P2 fail at the beginning of the iteration.

◆ Master node notifies - P3 to process D2 and D3 - P4 to process D4

◆ Since all D1 blocks are lost, master node reads D1 from the file system/storage cluster and notifies P4to process it. D1 D2

D6 D7

D3 D4

D1 D8

D5 D6

D2 D3

D7 D8

D4 D5

Experimental Setup◆ Implemented k-means and apriori algorithms in C programming language by using MPI library.◆ Used 2.5 GHz Opterons processors and 24 GB memory

◆ The number of processors is 8 ◆ In the experiments with Hadoop:

◆Replication factor(R) : 3 ◆Summarization frequency(S) : 4

Impact of Summary Exchange Frequency in Apriori: Varying Number of Failures

Total Execution Time that Changes with the Number of Failures

Experimental Results

P1

1 2

P2

3 4

P3

5 6

P4

7 8

P5

9 10

P6

11 12

P7

13 14

12 13 1 14 2 3 4 5 6 7 8 9 10 11

primary

replica

Top Related