review on fault tolerance mechanism implemented in grid systems

Review on fault tolerance mechanism implementedin Grid Systems

Anisa SHEHUDepartment of Computer Engineering

Polytechnic University of TiranaSheshi Nene Tereza,Tirane AlbaniaEmail: [email protected]

Edlira GEGADepartment of Computer Engineering

Polytechnic University of TiranaSheshi Nene Tereza,Tirane Albania

Email: [email protected]

Abstract—A Grid represents an association of re-sources located in different administrative domainswhich work together on common tasks. Resources in-volved are characterized by heterogeneity and dynamicavailability. Due to these characteristics node failuresare considered normal behavior hence fault toleranceis considered to be crucial for computational grid anda challenging issue at the same time. In this reviewwe present some key fault tolerance strategies imple-mented in grid environments.

Key words:Distributed System, Grid Computing,Fault Tolerance mechanism and algorithms, Replication,Scheduling, Check-pointing

I. Introduction

A grid represents a set of interconnected resources, vir-tually and geographically distributed across several admin-istrative domains. This virtual organization is expectedto work as a whole by coordinating its components whilemaintaining transparency to the user. Grid computingengages these resources to work on a common task likeintensive computational calculations, complex scientificproblems, weather forecast, genome matching, a widerange of simulations etc. Due to its heterogeneity, its largescale and dynamic nature, the grid system is prone tofaults and errors risking its integrity. For this reason itis crucial for this environment to implement fault tolerantand data recovery mechanisms.

II. Grid Architecture

There have been various approaches in conceptualizingthe grid architecture [1]. In the 5 layer approach the firstlayer denominated as the fabric layer provides resources,be they physical units such as storage systems, servers, cat-alogs or logical units which include among others databasesystems, shared file systems etc.

Figure 1. Layered architecture of Grid

This layer also provides the interfaces to these resourceswhich commonly include functions to query the state andcapabilities of these resources. The communication withthese units and service delivery is made according to gridprotocols. The connectivity layerhandles communicationand authentication protocols. This layer is responsiblefor secure transactions across the network. For example,in many cases programs instead of human users mustbe authenticated and delegating the rights from users toprograms is an issue to address here. The resource layer isengaged in providing the protocols for implementing ini-tiation, negotiation, accounting, monitoring and paymentfor individual resources. The information protocols and themanagement protocols are two groups of protocols usedin this layer. The resource layer functionality builds ontop the functions provided by the connectivity layer andcalls the interfaces of the fabric layer. Typically this layercan provide functions to retrieve configuration informa-tion on a specific resource, read data or create processesin single resources. The collective layer deals with theglobal resource management. Tasks of this layer includecontrolling the access to multiple resources and offeringservices for resource discovery, allocation, task schedulingetc. Finally the application layercovers the user applica-tions that exploit the capabilities and services offered bythe grid computing environment. In the other approachesthe functions of these layers are dispersed or reorganized.

Thus, on the four layer approach, the first layer coversthe grid resources, physical facilities, network resources,data storage etc. The second layer, network, is responsiblefor the reliable communication among these resources.The middleware layer is similar to the connectivity layerin the 5 layers approach in terms of functionality andcompetence, but also includes some of the resource layer’stasks such as monitoring. The top layer in this approach isthe application layer. In the 3 layer approach the networklayer is disintegrated and the allocation of resources forjobs, the implementation of secure transactions and othermanagement issues are included in the middleware layer.The three layers of this approach-bottom up- are: theresource layer, the middleware and the application layer.

III. Grid Characteristics

A key issue in grid computing systems is that resourcesfrom different domains and organizations are grouped toform a virtual organization enabling people to work oncommon tasks and projects. In this point of view resourcesharing and coordination are two features ghat define thegrid environment.The single resources can vary from simple sensors, scien-tific instruments or display devices to large data systemsand supercomputers. The heterogeneityof resources is amajor characteristic of the grid computing system.As mentioned above, many of the resources are locatedin different organizations, which are possibly geographi-cally dispersed and are managed differently. This aspectuncovers another characteristic of the grid, that of multipleadministration. Each organization can implement its ownprivacy and security policies under which they resourcescan be accessed. Certainly the communication among thecomponents must be transparentto users. The system inits entirety must appear as a single virtual computer.Another typical attribute of the grid environment is thescalability.A grid must be able to handle a number ofresources ranging from a few to maybe millions. Despitethe number and variety of resources the grid must providedependable and consistent access. The services must bedelivered under agreed QoS requirements and they mustbe built on standard protocols and interfaces hence hidingthe heterogeneity of the single components. Surely un-der imperfect circumstances, practically communicationproblems can arise due to numerous reasons extendingfrom malicious nodes to physical failure of nodes andcommunication faults. The grid must offer pervasive ac-cess. . Independently of these faults the system must beflexible and able to adapt to dynamic conditions of theenvironment.

IV. On Fault Tolerance

Fault tolerance [2] refers to the ability to preserve thedelivery of expected services despite the presence of faultcaused errors within the system itself. It aims at the

avoidance of failures in the presence of faults.Some of the factors on which faults man be classified are:

• Physical faults: Memory, CPU and storage devicesfaults

• Network faults: Faults that occur due to networkfailures, packet loss, packet corruption

• Media faults: Disk failures, crashes• Process faults: Software bugs, resource shortage• Processor faults: Hardware or operating systemcrashes

• Service expiry fault: The service time of a resource ingrid may expire while the application is using it

• Interaction faults: Protocol incompatibilities, policyproblems, timing overhead

• Unconditional termination: User might have pressedCtrl+C

• Life-cycle faults: Legacy or version faultsBased on their occurrence faults can be:• Transient: faults that occur once and then disappear• Intermittent: faults reoccur with no certain pattern• Permanent: faults that are persistent and continue toexist up until external intervention.

Cristian [1991] and Hadzilagos and Toueg [1993] de-scribed another scheme to classify different types of fail-ures namely: Crash failure, omission failure, timing failure,response failure, arbitrary failure.

V. Fault Tolerant techniques and mechanismsimplemented in grid environments

In order to adapt and overcome failures, fault tolerantmechanisms need to be implemented in grid.

Figure 2: A basic fault tolerant grid architecture

The Grid Interface or User Interface provides an enduser with an interface to submit its job or data. TheAllocator/Grid Scheduler (GS) is responsible for selectingthe optimal resources for the job, based on the QoSrequirements established by the user. The InformationServer stores information about every single resource ofthe grid. Information can include load, memory available,CPU capacities etc. GS queries the Information Server

continuously to receive information on nodes in real time.Finally, the Fault Handler is responsible for detectingfailure of resources and updating information to the ISabout the time required to handle this faults

Fault tolerance aims at the avoidance of failures in thepresence of faults. It must be integrated as a service inthe system to detect errors and recover from casualties,avoiding thus the failure of the grid.

Below we will discuss briefly some of the algorithmsand techniques implemented in various fault tolerancerecovery strategies. Now we can speak about the techniqueof detection and “correction” of a system failure, but wehave to say that a great point in grid computing is aprevention and resistance techniques which use protectivemeasures[3]. In this stage we can prediction our systemfaults, which is better than passing in other stages andtake cares for recovery. Or we have to detect throughmonitoring and get notification from event managerwhich leads us to a recovery services such as described inFigure 3.

Figure 3.Stages of Prevention and Recovery faults in system

Faults in grid environment are problematic because there canbe a kind of fault that cause a series of others faults, or can bein a strange situation where more than one faults happen inthe same time. There are two main category of fault tolerancetechnique based on in which stage it used, and it’s obviousthat proactive methodology is used for prevention stage andreactive is used in the reduction stage.

Figure 4.Classification of Fault tolerance and techniques

In the table below it is a short comparison between proactiveand reactive FT.

Table 1. Comparison of proactive and reactive fault tolerancetechnique

A. Proactive FT:It prevents or avoiding faults during theapplication is running by predicting them based on expe-rience. Some of the mechanism based on this techniqueare preemptive migration and software rejuvenation/self-healing[4]:a. FT using rejuvenation:This method clean up the system

internal state periodically, hat avoid potential attackers.Self-healing is a software rejuvenation that improve QoSand performance of servers.

b. FT using preemptive migration: :This technique migratethose part of components of system away form thatcomponent that is predicted to fail soon.

B. Reactive FT:It prevents or avoiding faults during theapplication is running by predicting them based on expe-rience. Some of the mechanism based on this techniqueare preemptive migration and software rejuvenation/self-healing[4]:a. Replication: -refers to creating multiple copies of

data/jobs and sending them to different independententities in grid to avoid a single point of failure andso that at least one replica is guaranteed to delivercorrectly. The main idea behind job replication or jobmigration is sending the original job and its copies todifferent CPUs to run and wait until the first replicationis finished [5]. Job replication is commonly used toenhance availability of the grid. Replication is based onthe assumption that the probability of a single sourcefailure is much higher than of a simultaneous failure ofmultiple resources [6].

b. Rescheduling: Replication and Scheduling are oftencoupled in many fault tolerant mechanisms. In jobscheduling, submitted jobs are divided into sub tasksand allocated to resources. Resource scheduling matchesthe query for a resource to a group of available re-sources. A number of techniques and algorithms com-bine fault tolerance recovery strategies. In case of failedtasks, rescheduling comes to hand to find differentresources that can accept and run failed tasks. Abawajy[7] proposes a fault tolerant scheduling policy (DFTS)that couples job scheduling with job replication aimingat efficient job run. He assumes the system is dividedinto sites where each of them has a scheduling manager.These scheduling managers act as backups for eachother, where one scheduling manager supports another.Each job replica is scheduled to be run on a differentsite. This algorithm is a static one and uses a knownnumber of replicas given by the user upon job sub-mission. Comparison of this scheduler against a non-fault-tolerant scheduling policy showed that this policyperforms reasonably in the presence of various types offailures.However having a fixed number of replicas generates ex-cessive utilization of resources and longer response time.

The results yielded by static job replication algorithmsare topped by dynamic or adaptive job replicationalgorithms. [6].A mechanism based on Job Replication is introducedby Amoon [8]. This mechanism seeks to determinedynamically the number of job replicas based on failurehistory of the system and then schedule these replicas.The goal is that of having the minimal number ofindispensable replicas to minimize system overhead.Upon job completion by a certain replica, other replicasof the job are terminated in order to free up backupresources.Adaptive Job Replication (AJR) algorithm is used todetermine the number of job replicas followed by theBackup Resources Selection (BRS) algorithm whichschedules the replicas.While the number of job replicas is determined basedon the failure tendency of the resources handling thejob, the backup resources are established according tothe current load of the resources allocated to the job.

Figure 5. AJR algorithm performance compared toDFTS

Simulation results show a good performance gain of theAJR algorithm compared to DFTS in terms of grid load;mainly because of the fact that the DFTS is a staticalgorithm.

c. Checkpointing/Restart:-Checkpoint is a technique thattake a snapshot and save the state of the running appli-cation in a secure storage, which is ready to recoveredin any fault situation , to provide a coherent system.This technique avoid restarting of application in itsbeginning. A really critic aspect over check pointing andrecovery technique is the length of the interval, whichcan cause overhead.

As we know, the grid computing is a set of hierarchical clusterconnected together and sharing their own data. In this point ofview, the backup takes place in four phases:

1) Initialization: an initiator sends a checkpoint-request toits leader,

2) Coordination of leaders: the leader forward the checkpointrequest to the other leaders

3) Local checkpointing Each leader initiates a checkpointinside its cluster

4) Termination:When local checkpoint is over, each leadersends an acknowledgment to the initial leader.

A consistent global checkpoint is a set of N local checkpoints,one from each process, forming a consistent system state[9].

Figure 6. Hierarchical checkpoint in grid computing

As we mention above, checkpoint interval is so importantfactor since it influence on overhead in system. There aresome other factors that affect the efficiency and performanceof checkpoint mechanism such as checkpoint size, availabilityand frequency of checkpoint. In the [10] and [11] is suggestthat whether to keep all the checkpoint data on a single nodeor more than one [10] and [11] figure out that the size of asingle checkpoint can be a problem because the storage spacerequirement will also increase with that.

Figure 7. Factor that effect on checkpoint efficiency

Checkpoint/Restart is usable for those caseswhere the application is big and running for a longtime, where after every change of system has todo a checkpoint or to restart recently checkpoint.

Figure 8. Classification of Check-point mechanism

This technique can categorize by:

i. Level of applayinga) Application level: The main key of this checkpoint

technique is the application code which means thatit have to manage and store checkpoint itself. In[12], it’s proposed a tool for application level check-pointCPPC(Com Piler for Portable Checkpoint) whichis focused on inserting checkpoint code into long run-ning message passing application.

b) System level: This kind of checkpoints is done at themoment when the system is executing a task. Thedisadvantage of this level of checkpoint is that it requiremore data to save the state of application checkpoint.It’s a great point to figure out that replication is usedto run duplicate jobs, and checkpoint technique to storethe intermediate results. As it said in [10], this techniqueresults better than application level checkpoint.

ii. Method of taking: : is a kind of classification that describethe method of taking checkpointa) Uncoordinated checkpoint: In this method, each process

can take a checkpoint whenever it is most convenient,but this method leads to a domino effect, or ate thesituation where there is no need for a global checkpointfor a consistent state of system, since this process isuncoordinated with the others. It’s obvious that thismethod has some problems.

b) Coordinated checkpoint: In this method it is process’sduty to cooperate with other processes checkpoint tocreate a consistent global state, which avoid dominoeffect. Each process has to manage itself permanentcheckpoint on stable storage which decrease overhead.Even these advantage, this method create large latencyin giving the result.

c) Communication- include: This technique is the best onebased on [18] in which is described that this techniquesavoid domino effect and also allows different processesto take some of their checkpoints independently.

iii. Types: This categorization is done over coordinated check-point.a) Blocking: In this type of checkpoint, the process block

its execution task till the process send the acknowledg-ment of checkpoint is taken.

b) Non-Blocking: : In this type of checkpoint the processcontinue its execution despite of checkpoint phase.

The technique of Fault tolerance that use Adaptive Replicationin Grid Computing (FTARG) proposed by Srinivasa at [13] isan adaptive replication middleware which replicates data atdifferent sites. The FTARG synchronizes data across hetero-geneous databases in the grid by offering numerous synchro-nization modes. The FTARG uses Totem single ring protocolto manage the safe delivery of messages in a broadcast domainand the group membership is handled by the Internet GroupManagement Protocol.Experimental analysis proved that FTARG improves the per-formance of data management for large scale complex gridbased applications by reducing response time to clients com-pared to JGroups in terms of number of nodes and concurrentusers.Research on Novel Dynamic Resource Management and jobscheduling in grid computing (RNDRM) [14] is uses two-layeredHeap Sort Trees (HST) model computational resources in orderto calculate their available computational power and that ofthe whole system. The node with the greatest computationalability is selected to be the root node of the HST and is

ready to receive jobs by the scheduler. Advantages that thisstrategy provides include: enhancing scalability, robustness,fault-tolerance, higher efficiency, dynamic status informationof system nodes in an unpredictable grid environment. On theother hand this strategy doesn’t report in case of job submissionfailure, may not utilize resource submission failure, high jobwaiting time and lacks on providing real time dynamic gridenvironment.Agent Based Resource Management with Alternate Solution(ABRMAS) [15] identifies an alternative resource to handle ajob in case of failure in resource discovery, without affectingthe performance. This approach reduces delay overhead inwaiting for the unavailable resource and enhances the system’sefficiency. Implementation result shows the system success rateis 30% higher with alternate solution. This strategy is usefulnot only in case of failure in resource discovery but also whenmore than one solution proposal is offered.A decentralized fault tolerant model for Grid Computing [16]models grid resources as nodes in a dynamic colored graph. Adecentralized model skips the limitations encountered by thecentralized and hierarchical model regarding fault tolerancescalability and autonomy. Nonetheless this approach has ahigher level of complexity and poses other problems linked tothe management of distributed information, resources coordi-nation and security.In this model, for each of the grid’s nodes, a set of neighbors-collaborators is defined, able to replace it in case of failure. Thenumber of collaborators is limited by a threshold alpha definedby the user (represents a degree of tolerance). In the firststep collaborators are searched among the neighboring nodes.These nodes are classified intro three categories depending onthe value of the threshold alpha: stable, unstable and hyperstable nodes. In the second step – the stabilization phase, eachunstable node attempts to auto stabilize through hyper stablenodes across the graph. In the proposed model, the graphalways converges to a stable state. The experimental resultsshowed that the method based on the average neighbors yieldedbetter results in terms of stabilization of vertices.Congfeng Jiang et al [17] introduce a Fuzzy-logic based Self-Adaptive job Replication Scheduling (FSARS) algorithm tohandle uncertainties of job replication number, closely relatedto trust factors behind grid sites or user jobs. The algorithmmatches the user security requirements and the resource trustlevel [6]. Their experiments showed a higher scheduling successrate and less grid resource utilization can be achieved throughFSARS.

VI. ConclusionIn this paper we presented an overview of the grid system and

fault tolerance recovery strategies. We discussed several faultclassification methods, the importance of systems resistant tofaults and some of the key mechanisms based on data and jobreplication, scheduling and check-pointing in grids. Trends indistributed computing recently favor the cloud and a great dealof research is directed towards developing efficient and faulttolerant techniques implemented in the cloud. As future workwe consider further exploring techniques and algorithms forenhancing performance in distributed environments.

References[1] Carmen Bratosin, Wil van der Aalst, Natalia Sidorova, and

Nikola Trcka “A Reference Model for Grid Architectures and ItsAnalysis”.

[2] Mrs Radhal, Dr.V.Sumathy: “A Detailed Study of ResourceScheduling and Fault Tolerance in Grid”; IJCSI,Vol.8. Issue 6,No.2, November 2011.

[3] R.K.Bawa and Ramandeep Singh “Comparative Analysis ofFault Tolerance Techniques in Grid Environment”- March 2012,IJCA-Volume41.

[4] Amritpal Singh, Supriya Kinger, “An Efficient Fault ToleranceMechanism Based on Moving Averages Algorithm” c©2013,IJARCSSE, ISSN: 2277 128X.

[5] S Dilli Babu, Ch.Ramesh Babu, Ch.D.V Subba Rao: “Anefficient fault-tolerance technique using check-pointing andreplication in grids using Data Logs”; Vol.04, Special Issue01;2013.

[6] Altameem, T.: “Fault Tolerance Techniques in Grid ComputingSystems”; Vol.4(6), 2013, 858-862.

[7] Abawajy, J.: “Fault-Tolerant Scheduling Policy for GridComputing Systems”; IPDPS’04, Santa Fe, New Mexico, pp.238–244, 2004.

[8] M. Amoon, “Design of a Fault-Tolerant Scheduling System forGrid Computing”2011, IEEE Computer Society, pp. 104-108.

[9] Saritha.G, MTech(SE) “Fault tolerant mechanisms for efficientdata recovery in grid environment”- IJERA, Vol.1, Issue 3.

[10] Yulan Yin, Yanhong Zhao, Fengna Dai, “Fault ToleranceScheduling in Economic Grids”. IEEE P 2252-2256 2011.

[11] Jasma Balasangameshwara, Nedunchezhian Raju, “A FaultTolerance Optimal Neighbor Load Balancing Algorithm for GridEnvironment”, IEEE P 428 – 433 2010.

[12] Paul Townend, Jie Xu, “Fault Tolerance within a GridEnvironment”;IEEE, 2009.

[13] K. Srinivasa, G. Siddesh and S.Cherian: “Fault-tolerantmiddleware for grid computing” ; IEEE pp. 635-640, Sep. 1-3,2010.

[14] Fufang Li, Deyu Qi, Limin Zhang, Xianguang Zhang, and ZhiliZhang, “Research on Novel Dynamic Resource Management andJob Scheduling in Grid Computing”, IEEE IMSCCS 2006.

[15] Ms.P.Muthuchelvi, Dr. V. Ramachandran , “ABRMAS: AgentBased Resource Management with Alternate Solution”, IEEE,GCC 2007.

[16] Mohammed Rebbah, Yahya Slimani, Abdelkader Benyettou:“A decentralized Fault Tolerant model for Grid Computing”;IJCSI, pp.123-130, Vol.11, Issue 1, No 2, January 2014.

[17] Jiang, C., Wang, C., Liu, X., Zhao, Y.: “A Fuzzy Logic Approachfor Secure and Fault Tolerant Grid Job Scheduling”; Volume4610, pp. 549-558,July 11-13, 2007.

[18] Samir Jafar, Axel Krings, Senior Member, IEEE, and ThierryGautier. “Flexible Rollback Recovery in Dynamic HeterogeneousGrid Computing”, In IEEE, vol. 6, no. 1, 2009