fault tolerant parallel data-intensive algorithms

Fault Tolerant Parallel Data-Intensive Algorithms

Mucahid Kutlu Gagan Agrawal Oguz KurtDepartment of Computer Science and Engineering

The Ohio State UniversityDepartment of Mathematics

The Ohio State University

†

†

HIPC'12 Pune, India

Outline• Motivation• Related Work• Our Goal• Data Intensive Algorithms• Our Approach– Data Distribution– Fault Tolerant Algorithms– Recovery

• Experiments• Conclusion

HIPC'12 Pune, India 2

Why Is Fault Tolerant So Important?• Typical first year for a new cluster*

- 1000 individual machine failures- 1 PDU failure (~500-1000 machines suddenly disappear)- 20 rack failures (40-80 machines disappear,1-6 hours to get

back)- Other failures because of overheating, maintenance, …

• If you have a code which runs on 1000 computers more than one day, it is high probability that you will get a failure before your code finishes.

*taken from Jeff Dean’s talk in Google IO(http://perspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx)

HIPC'12 Pune,India 3

Fault Tolerance So Far • MPI fault-tolerance, which focus on checkpointing [1],

[2], [3]. - High overhead• MapReduce[4]- Good at fault tolerance for data-intensive applications• Algorithm-based fault-tolerance- Most of them are for scientific computation like

linear algebra routines [5], [6], iterative computations [7], including conjugate gradient [8].


Our Goal• Our main goal is to develop algorithm-based fault-

tolerance solution for data intensive algorithms.• Target Failure : – We focused on hardware failures.– We lose everything on the failed node.– We do not recover that failed node and continue

process without it.• In an iterative algorithm, when a failure occurs:– The system shouldn’t start the failed iteration from the

beginning – The amount of data lost should be small as much as

possible.


Data Intensive Algorithms• We focused on two algorithms: K-Means & Apriori• But it can be generalized to all algorithms that have

the following reduction processing structure.


Our Approach• We used Master-Slave approach.•Replication : Divide the data to be replicated in parts, and distribute them among different processors.

- The amount of lost data will be smaller.

•Summarization : The slaves can send the results of parts of their data before they processed all the data.

- We won’t need to re-process the data that we already got the results.


MasterMaster

P1P1 P2P2 P3P3 P4P4

D1D1 D2D2 D3D3 D4D4

D5D5 D6D6 D7D7 D8D8

D1D1 D2D2 D3D3 D4D4

D5D5 D6D6 D7D7 D8D8

Data DistributionReplication

P3P3P1P1 P5P5 P7P7P4P4P2P2 P6P6

77

7722 22

77

77

K-means with No Failure

55

44

11

55 1313

44

66

11

11 22

66 33

33

5533

11 22

77 88

99 1010

33 44

11 22

1111 1212

55 66

11 22

1313 1414

77 88

6633 44

55 66

99 1010

33 44

1313 1414

1111 1212

55 66

99 1010

1414

1111 1212

77 88

Primary Data

MasterMaster

Replicas

Send the summary of data portion 1

11 221- 141- 14

Broadcast new centroids

Single Node Failure

Recovery-Master node notifies P3 to process D2 at this iteration and P3 to process D1 starting from next iteration.

Case 1: P1 Fails after sending the result of D1

P3P3 P4P4 P5P5 P6P6 P7P7P2P2P1P1

11 22

77 88

99 1010

33 44

11 22

1111 1212

55 66

11 22

1313 1414

77 88

33 44

55 66

99 1010

33 44

1313 1414

1111 1212

55 66

99 1010

1313 1414

1111 1212

77 88

Primary Data

MasterMaster

Replicas


Multiple Node Failure

Recovery-Master node notifies P7 to read first data block(D1 and D2) from storage cluster.-Master Node notifies P6 to processD5 and D6, P4 to process D3, P5 to process D4.-P7 reads D1 and D2 P3P3 P4P4 P5P5 P6P6 P7P7P2P2P1P1

11 22

77 88

99 1010

33 44

11 22

1111 1212

55 66

11 22

1313 1414

77 88

33 44

55 66

99 1010

33 44

1313 1414

1111 1212

55 66

99 1010

1313 1414

1111 1212

77 88

Primary Data

MasterMaster

Replicas

Storage ClusterStorage Cluster

11 22


Case 2: P1, P2 and P3 Fail, D1 and D2 are lost totally!

Experiments• We used Glenn at OSC for our experiments.• The nodes have :

– Dual socket, quad core 2.5 GHz Opterons– 24 GB RAM

• Allocated 16 slave nodes with 1 core per each.• Implemented in C programming language by using MPI library.• Generated Datasets:

– K-Means• Size : 4.87 GB• Coordinate Number : 20 • Maximum Iteration : 50

– Apriori• Size : 4.79 GB• Item Number : 10• Support Vector :%1• Maximum Rule Size :6 (847 rules)



Effect of Summarization

Changing Message Numbers with Different Percentages

Changing Data Block Numbers with Different Number of Failures


Effect of Replication for Apriori Scalability Test for Apriori

Effect of Replication & Scalability

Fault Tolerance in Map Reduce

• Replication data in file system– We replicate the data in processors, not in file

system• If a task fails, re-execute it.• Completed map tasks also need to be re-

executed since the results are stored in local disks.

• Because of dynamics scheduling, it can have better parallelism after a failure.


Experiments with Hadoop

• Experimental Setup– Allocated 17 nodes and 1 core per each.– Used one of nodes as master and the rest as slaves– Used default chunk size– No backup nodes are used– Replication is set to 3

• Each test executed 5 times and took the average of 3 after eliminating the maximum and minimum results.

• For our system, we set R=3 and S=4 and M=2• Calculated total time, including I/O operation



Single Failure Occurring at different Percentages

Multiple Failure Test for Apriori

Experiments with Hadoop(Cont’d)

Conclusion

• Summarization has good effect in fault tolerance– We recover faster when we divide the data into

more parts• Dividing data into small parts and distributing

them with minimum intersection decrease the amount of data to be recovered in multiple failures.

• Our system behaves much better Hadoop


References1. J. Hursey, J.M. Squyres, T.I. Mattox, and A. Lumsdaine. The design and implementation of checkpoint/restart

process fault tolerance for open mpi. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, pages 1 –8, march 2007

2. G. Bosilca, A. Bouteiller, F. Cappello, S. Djilali, G. Fedak, C. Germain, T. Herault, P. Lemarinier, O. Lodygensky, F. Magniette, V. Neri, and A. Selikhov. Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In Supercomputing, ACM/IEEE 2002 Conference, page 29, nov. 2002.

3. Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, SC ’06 ACM, 2006.

4. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137–150, 2004.

5. J.S. Plank, Youngbae Kim, and J.J. Dongarra. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers., Twenty-Fifth International Symposium on, pages 351 –360, jun 1995.

6. Teresa Davies, Christer Karlsson, Hui Liu, Chong Ding, and Zizhong Chen. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In Proceedings of the international conference on Supercomputing, ICS ’11, pages 162–171. ACM, 2011.

7. Zizhong Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th ACM International Symposium on High Performance Distributed Computing, HPDC 2011, San Jose, CA, USA, June 8-11, 2011, pages 73–84, 2011.

8. Zizhong Chen and J. Dongarra. A scalable checkpoint encoding algorithm for diskless checkpointing. In High Assurance Systems Engineering Symposium, 2008. HASE 2008. 11th IEEE, pages 71 –79, dec. 2008.


Thank you for listening..

Any questions?


fault tolerant parallel data-intensive algorithms

Documents

indiahipc12 pune

lost data

data blockd1

failed node

summary of data portion

mpi faulttolerance

failed iteration

indiafault tolerance