fault tolerant parallel data-intensive algorithms

Fault Tolerant Parallel Data-Intensive Algorithms Mucahid Kutlu, Gagan Agrawal, Oguz Kurt (Ohio State University) Introduction and Motivation ◆ The Mean-Time-To-Failure (MTTF) of the systems is decreasing with growing number of cores. ◆For the future exascale systems, it is being argued that check- pointing and recovery time (with current methods) will even exceed the MTTF. ◆ Algorithm-based fault-tolerance can be alternative method Our Goal ◆ We focus on only fail-stop failures. ◆ We do not use any back up node and continue the process with remaining nodes after the failure. ◆We have two main goals for faster recovery: ◆ minimize the data loss, since the lost data needs to be reread from the storage cluster ◆ minimize re-processing of the lost data Our Approach - Intelligent Replication ◆ Minimum Data Intersection by dividing data into blocks and distributing them in different processors. ◆ Passive Replicas - Summarization ◆After processing one block, a summary is generated for that block and sent to the master node. ◆ No need to re-process the blocks that a summary is already sent before the failure. Master File System P1 P2 P3 P4 Recovery Scenario ◆ P1 and P2 fail at the beginning of the iteration. ◆ Master node notifies - P3 to process D2 and D3 - P4 to process D4 ◆ Since all D1 blocks are lost, master node reads D1 from the file system/storage cluster and notifies P4 to process it. D1 D2 D6 D7 D3 D4 D1 D8 D5 D6 D2 D3 D7 D8 D4 D5 Experimental Setup ◆ Implemented k-means and apriori algorithms in C programming language by using MPI library. ◆ Used 2.5 GHz Opterons processors and 24 GB memory ◆ The number of processors is 8 ◆ In the experiments with Hadoop: ◆Replication factor(R) : 3 ◆Summarization frequency(S) : 4 Impact of Summary Exchange Frequency in Apriori: Varying Number of Failures Total Execution Time that Changes with the Number of Failures Experimental Results P1 1 2 P2 3 4 P3 5 6 P4 7 8 P5 9 1 0 P6 1 1 1 2 P7 1 3 1 4 1 2 1 3 1 1 4 2 3 4 5 6 7 8 9 1 0 1 1 primary replica

Upload: milo

Post on 23-Feb-2016

20 views

Category:

Documents

0 download

Report

Download

Tags:

Embed Size (px):

DESCRIPTION

Fault Tolerant Parallel Data-Intensive Algorithms Mucahid Kutlu, Gagan Agrawal, Oguz Kurt(Ohio State University). Introduction and Motivation ◆ The Mean-Time-To-Failure (MTTF) of the systems is decreasing with growing number of cores. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fault Tolerant Parallel Data-Intensive Algorithms

Fault Tolerant Parallel Data-Intensive AlgorithmsMucahid Kutlu, Gagan Agrawal, Oguz Kurt (Ohio State University)

Introduction and Motivation

◆ The Mean-Time-To-Failure (MTTF) of the systems is decreasing with growing number of cores.

◆For the future exascale systems, it is being argued that check- pointing and recovery time (with current methods) will even exceed the MTTF.

◆ Algorithm-based fault-tolerance can be alternative method

Our Goal◆ We focus on only fail-stop failures.

◆ We do not use any back up node and continue the process with remaining nodes after the failure.

◆We have two main goals for faster recovery: ◆ minimize the data loss, since the lost data needs to be reread from the storage cluster ◆ minimize re-processing of the lost data

Our Approach- Intelligent Replication

◆ Minimum Data Intersection by dividing data into blocks and distributing them in different processors.

◆ Passive Replicas

- Summarization

◆After processing one block, a summary is generated for that block and sent to the master node.

◆ No need to re-process the blocks that a summary is already sent before the failure.

MasterFile

System

P1 P2 P3 P4

Recovery Scenario

◆ P1 and P2 fail at the beginning of the iteration.

◆ Master node notifies - P3 to process D2 and D3 - P4 to process D4

◆ Since all D1 blocks are lost, master node reads D1 from the file system/storage cluster and notifies P4to process it. D1 D2

D6 D7

D3 D4

D1 D8

D5 D6

D2 D3

D7 D8

D4 D5

Experimental Setup◆ Implemented k-means and apriori algorithms in C programming language by using MPI library.◆ Used 2.5 GHz Opterons processors and 24 GB memory

◆ The number of processors is 8 ◆ In the experiments with Hadoop:

◆Replication factor(R) : 3 ◆Summarization frequency(S) : 4

Impact of Summary Exchange Frequency in Apriori: Varying Number of Failures

Total Execution Time that Changes with the Number of Failures

Experimental Results

1 2

3 4

5 6

7 8

9 10

11 12

13 14