design and analysis of peer-to-peer fault-tolerance …thagorn tangmankhong*, peerapon...

Chiang Mai J. Sci. 2017; 44(2) : 688-698http://epg.science.cmu.ac.th/ejournal/Contributed Paper

Design and Analysis of Peer-to-Peer Fault-Tolerance Approach in a Grid Computing SystemThagorn Tangmankhong*, Peerapon Siripongwutikorn and Tiranee AchalakulDepartment of Computer Engineering, Faculty of Engineering, King Mongkut’s University of Technology Thonburi, 126 Pracha-Uthit Rd., Bangmod, Thungkhru, Bangkok, 10140, Thailand.*Author for correspondence; e-mail: [email protected]

Received 9 June 2016Accepted 11 August 2016

ABSTRACT Agridcomputingsystemallowsalargecomplexcomputingtasktoefficientlyutilize

high computing resources by splitting the task into many compute processes to be distributed and executed in parallel at many grid nodes. Under such paradigm, the system fault tolerance is the major issue as the failure of one grid node results in the task failure. Most fault tolerance techniques for a grid computing system are based on periodic savings of checkpoint data, which is used to roll back the system to the last good operating state when the failure occurs. In this paper, the fault tolerance technique based on peer-to-peer replication of checkpoint dataisdesignedandanalyzed.Theideaistoallowchunksof checkpointdatatobereplicatedatdifferentbackupnodestofacilitatefasterrecoverytimeinthefailurerecoveryprocess.Thereplicationtimeunderthepeer-to-peerreplicationprocedureisanalyzedtoobtainproperchoicesof chunksizeandbackupgroupsize.Asignificantreductionintherecoverytimecomparedtothe traditional client-server approach is also gained by using the peer-to-peer replication.

Keywords: fault tolerance, grid computing, peer-to-peer replication, replication time, peer-to-peer fault tolerance

1. INTRODUCTIONA grid computing system is a group

of heterogeneous nodes at geographically dispersed sites, which together can provide high performance computing power and large storage space to running applications. For example, an application that involves a largecomplexcomputingtaskmayutilizethecomputingpowerof severalnodestofinishthetaskinamuchsmallertime.Theoriginaltask is divided into many subtasks or compute processes.Thesecomputeprocessesarethendistributed to a group of nodes, referred to

as a working group that collaboratively execute the subtasks to accomplish the desired goal.

Under this grid computing paradigm, the node reliability is a major issue because even a single working node failure results in the applicationfailure.Thetypeof failuretreatedinthis paper is the hardware failure, where the node becomescompletelydead.Therefore,itisvitalthat a grid computing system be fault-tolerant, which can be achieved by either forward recovery [1, 2, 3] or backward recovery [4, 5, 6, 7, 8, 9]. In forward recovery, a process in a working node

Chiang Mai J. Sci. 2017; 44(2) 689

is duplicated at many backup nodes and those duplicated process run in parallel. If the working node fails, the system can restore the process from the next good states at one of the backup nodes.Theapproachresultsinfastrecoverytime but requires high resource consumption as the number of compute processes increases. Therefore,forwardrecoveryisnotpracticalfora large number of compute processes.

Backward recovery, also known as checkpoint/restart model [6], uses the last good state before the failure occurrence to restore the operation. In this approach, several mechanisms are needed, including Checkpointing, Replication, and Recovery [8, 10, 11, 12, 13–15, 16].

Checkpointing All running processes on the working group are periodically suspended simultaneously and the current states of those running processes in the memory, referred to as checkpoint data, are saved to local hard drives. Checkpoint data is essentially a snapshot of the operation states that can be used in the failure recovery process.

Replication For an application requiring high reliability, checkpoint data may be duplicated from the local disk to backup nodes. Thereplicationtimeresultsfromthedatatransfer from a working node to backup nodes over the network.

Failure Recovery When a working node failure occurs, each node rolls back to the checkpoint data to recover from the failure.

Themainfocusinthisworkisthereplicationtechnique.Traditionalreplicationtechniquesina grid computing system are based on a client-server approach, where the checkpoint data of a working node is replicated as a whole copy to one or more backup nodes depending on the reliability requirement. When a working node fails, its operation state is restored from the checkpoint data in one of the backup nodes. Thebackuplocationscanbealocalnode,asharedfilesystem,ordistributedoverthenetwork [17]. Using a local host for backup

may not produce a high tolerance to failure. If the local node fails, the backup data could be damaged, preventing the full recovery of the system states. In order to increase the level of resiliency, backup data should be stored at the centralsharedfilestorageinsteadof alocalhost.However, if multiple replications are performed, the bottleneck problem may occur at the central system.Toalleviatesuchaproblem,asharedfilesystemalongwithmultiplelocalstoragescanbeutilizedforbackups.Backupfilescanbesplit into smaller pieces and then distributed to multiplenodes[17,18].Themethodisreferredto as ”distributed checkpointing” or distributed replication. While this method can eliminate a single point of failure problem, it consumes a higher bandwidth because sending multiple smallfilestomultiplelocationsgenerallycreatesmorenetworktrafficthansendingonelargefiletoonelocation.Currently,variationsof thedistributed replication method were implemented on many systems and many studies have been performed to tackle the overhead problem.

Thispaperproposesthepeer-to-peerfault-tolerance approach in a grid computing system. We substantially extend the work in [19] byrefiningtheprotocoldetails,developingthemathematical analysis of the replication time, and conducting comprehensive performance evaluation.Themainconceptistocreatea group of compute processes that share backup data in such a way that the replication time and the recovery time become smaller. Detailed procedures of how checkpoint data are distributed among backup compute processes as well as the operation under failure recoveryareexplained.Thereplicationtimeisalsomathematicallyanalyzedtoobtainproperchoices of the system parameters, and the result is validated against simulation.

Thepaperisorganizedasfollows.Section2 presents the system environment, notations, and assumptions on which our work is based, as well as the key components in our proposed

Chiang Mai J. Sci. 2017; 44(2)690

work. It also describes the procedures of group forming, peer-to-peer replication, and failure recovery.Thereplicationtimeisanalyzedin Section 3 under a star topology and the proper choices of system parameter values are identified,andtheanalyticalresultisvalidatedagainstsimulation.Theconclusionisofferedin Section 4.

2. MATERIALS AND METHODS2.1 Grid Computing Environment

We consider a homogeneous grid computing environment that consists of many grid nodes connectedviaaTCP/IPnetwork.Thesystemconsists of many grid nodes and one front-end node running as the grid proxy. Figure 1 shows how various pieces of software components and compute processes inside a grid site interact. All grid nodes and the front-end node run GlobusToolkit[20].Thefront-endnodealsoruns Condor [21] as a grid scheduler and acts as the grid proxy to which a user submits a job. Thesystemworksasfollows.Afteracceptingauser job, the grid proxy contacts Grid Resources Allocation and Management (GRAM) within Globus toolkit, which is responsible for creating Compute Processes (CPs) based on the submitted job and distributing them to the grid nodes (assumingoneCPpernode).TheCPsof asingle user job are said to be in the same working group, denoted by W. Note that the working group W of each user job is determined by

Condor, which handles the data dependency among CPs in the task allocation. Since all CPs in a working group belongs to the same user job, no irrelevant nodes are involved to incur unnecessarily energy consumption.

Each CP contains two components – (i) User application thread and (ii) the Peer-to-Peer fault-tolerantservice.ThetimingatwhichthesethreadsexecuteisdepictedinFigure2.ThePeer-to-Peer fault-tolerance service comprises the following threads:

Group Forming Thisthreadisexecutedonly once at the beginning to create a backup group for each CP.

Checkpointing Thisthreadisregularlyexecutedineverycheckpointinterval.ItnotifiesCondor at the front-end node to invoke the checkpointing process for each user application thread via the grid middleware. In each CP, the user application thread will be suspended while the operating states are saved to the checkpoint data.

Replication After the checkpointing thread finishes,thereplicationthreadreplicatesthecheckpoint data to other CPs in its backup group by using the proposed replication mechanism. All data transfers among CPs are carried out overTCPconnections.

Fault Detection During the normal process execution, the fault detection thread monitors if one of the CPs in its backup group failsandnotifiesthegridproxy.

Fault Recovery ThegridproxycontactsGRAM to create the new replacement CP to resumetheoperationof thefailedCP.Thereplacement CP gets a list of backup CPs and retrieves the checkpoint data from them.

Figure 1. Grid Computing Environment.

Formatted: Complex Script Font: Times New Roman

Figure 2.Timingdiagram for the threadexecution in the fault tolerance service.

Chiang Mai J. Sci. 2017; 44(2) 691

2.2 Peer-to-Peer Fault Tolerance ApproachThissectiondescribesgroupforming,

replication, and failure recovery procedures, which are the basis of the proposed peer-to-peer faulttoleranceapproach.Table1summarizesthenotationused.Thesetof backupCPsfora given CP is referred to as a backup group.Thenumber of CPs in a backup group is called the backup group size, denoted by B.Tobalancetheresources among CPs, we enforce that each CP can belong to at most B backup groups at any time. CPs in the same backup group are said to be peers. So, there will be |W| backup groups for each working group, all of which havingthesamesizeB, with B < |W|.

Initially, every CP s creates its own backup group Bs by executing the group forming procedure.Then,ineachcheckpointinginterval, the replication procedure is executed to distribute the checkpoint data to backup CPs. In the context of replication, the CP of interest acts as the source of checkpoint data, referred to as the source CP.Thegroupformingand checkpoint data replication procedures are explained below. Because all CPs in the working group perform identical operations, the group forming and replication procedures will be explained from a viewpoint of a single source CP.

2.2.1 Group forming procedureThegroupformingprocedureislistedin

Figure3.Toformabackupgroup,asourceCPsends Group-request messages with a unique ID (its compute process ID assigned by Condor) to all other CPs and waits for responses. Since we restrict that each CP joins at most B backup groups, only CPs that belong to less than B backup groups will acknowledge the Group-RequestmessagetogetherwiththeirID.Thesource CP collects the responses and ranks the responding nodes by their response times ascandidatesforitsbackupCPs.ThesourceCP sends the Member-Request message to B candidateCPsandwaitsfortheresponses.Thesource CP includes the CP that acknowledges its Member-Request message in its backup group. For each Group-Request and Member-Request message sent, the timeout intervals Tg and Tm are respectively used to trigger the retransmission. In the absence of the acknowledgment from a candidate backup CP for two retries, the source CP chooses the next backup CP in the candidate list to send the Member Request message. Since we restrict that each CP joins at most B backup groups with B < |W|, each CP always gets B backup CPs for its backup group.

Table 1. Notations for parameters.

Parameters Description

W Asetof CPsintheworkinggroupwithsize|W|

B Backupgroupsize

Bs A set of CPs in the backup group of CP s withsize|Bs| = B, B < |W|

Tg TimeoutintervalforGrouprequestmessage

Tm TimeoutintervalforMemberrequestmessage

Rl Replication level

Nc Number of chunks in checkpoint data

∆ Checkpointdatasize

Chiang Mai J. Sci. 2017; 44(2)692

Figure 3. Group forming procedure for source CP and backup CPs.

Chiang Mai J. Sci. 2017; 44(2) 693

After the source CP receives the acknowledgments for all its Member-Request messages, it sends the backup group member listtoallitsbackupCPs.Thelistisrankedbythe process IDs so that each backup CP knows both the backup group members as well as its successor in the backup group.

2.2.2 Replication procedureThereplicationprocedureimmediately

follows the checkpointing procedure. Each source CP divides the checkpoint data into Nc chunks that are integral multiples of B.Thatis,Nc = M ·B, M · Z+. Each chunk is numbered sequentially.

Denote Rl as the replication level, which is the number of copies of checkpoint data in addition to the source copy. At the end of the replication procedure, each chunk will be stored at Rl backupCPs.Thismeansthatif the application thread is recoverable if a source CP fails and at most Rl − 1 backup CPs fail.

Thereplicationprocedureisillustratedbyexample as follows. Consider a group of four working CPs with three backup CPs per backup group (|W| = 4, B = 3) as shown Figure 4. Suppose Nc = 9 and Rl = 2. Without loss of generality, let us denote a source CP of interest by P0, and the backup CPs by P1, P2, and P3, rankedbytheirprocessID.Thereplicationprocedure at the source CP and backup CPs work as follows:

Source CP: ThesourceCPtransfersNc/B chunkstoeachof itsbackupCP.Thetransferis carried out sequentially for chunks to the same backup CP, and in parallel for chunks

to different backup CPs. In the example, P0 sequentially transfers chunks 1, 4, 7 to P1, chunks 2, 5, 8 to P2, and chunks 3, 6, 9 to P3. Generally, P0 transfers chunks i + (j − 1)B to CP i, where i ∈ {1, 2, . . . , B} and j ∈ {1, 2, . . . , M }.

Backup CP: For chunks received from the source CP, the backup CP is responsible for distributing them to other Rl − 1 backup CPs. For Rl = 1, the backup CP only keeps chunks to itself. For Rl = 2, the backup CP transfers each chunk to its successor. In general, there exists Rl copies of each chunk at Rl backup CPs, which is done by each backup CP iteratively transferring a received chunk to its successor until Rl copies exist in Rl backupCPs.Thus,each backup CP eventually holds exactly M · Rl chunks. In this example of Rl = 2, P1 will replicate chunks 1, 4, 7 received from the source CP to P2, and likewise for P2 and P3. At the end of the replication procedure, P1 holds chunks 1, 3, 4, 6, 7, 9, P2 holds chunks 1, 2, 4, 5, 7, 8, and P3 holds chunks 2, 3, 5, 6, 8, 9.

Theabovereplicationprocedureenforcesthat Rl ≤ B. For Rl = B, all backup CPs will store the complete checkpoint data. In practice, Rl = 2 is commonly used (data is replicated at two sites), and higher values of Rl is rare.

2.2.3 Failure recovery procedureDuring the user application execution

period, every CP monitors the liveliness of its backup CPs (e.g., by periodically issuing the ping command). Consider the case when a particular CP p fails. All CPs that has p as its backup will detect the failure and send a fault detection message to the grid proxy, and hence many fault detection messages can be received by the grid proxy. After the grid proxy has verifiedthenodefailure,itrequestsGRAMtocreate a replacement CP at another grid node not in the current working group W.Then,thegridproxynotifiesthemonitoringCPsof the replacement CP. Because the replacement

(a) Stage 1 Replication (b) Stage 2 Replication (c) Stage 3 Replication

Figure 4. Replication mechanism.

Chiang Mai J. Sci. 2017; 44(2)694

CP needs to get the backup checkpoint data for the recovery, only monitoring CPs that are in the backup group of p arerelevant.ThoseCPs will send the backup group member list to the replacement CP, which in turn asks for chunks from backup CPs in the list. Once the complete checkpoint data has been acquired, the replacement CP calls Condor restore function to resume the user application execution thread.

3. RESULTS AND DISCUSSION3.1 Analysis of Replication Time

ThissectionanalyzesthereplicationtimeunderaTCP/IPnetworkasafunctionof checkpointdatasize,chunksize,andthebackupgroupsize.Themajordifferencebetweenourproposed Peer-to-Peer replication approach and the client-server replication approach lies in the replication time, which is the time taken for the checkpoint data of a source CP to be completely replicated at all of its backup CPs. In the client-server replication, the whole checkpoint data is transferred from the source CP to backup CPs while in the Peer-to-Peer replication, different chunks from the source CP are transferred to different backup CPs, and those chunks are also replicated among backup CP. Because chunks are transferred over the network, the replication time strongly depends on the network topology of the grid nodes andtheflowpattern,i.e.,howmanyflowsoneachnetworklinkatagiventime.Toenabletractable analysis, we assume that the grid nodes are interconnected in a star topology as shown in Figure 5. For a more complex topology, the replication time could be determined by manuallyconstructingtheflowpatterninthenetwork.Theperformanceonmorecomplextopologies is left for future work.

We denote the replication time by T , the checkpointdatasizebyD (in MB), the chunk sizeby∆ (inMB).ThereplicationlevelRl = 2 is assumed. Recall that in the replication procedure, the checkpoint data is divided

into chunks and individual chunks are then transferredamongbackuppeersoverTCPconnections. Consequently, the replication timecanbecalculatedbasedontheTCPconnection throughput and how chunks are transferredamongthepeers.Toanalyzethereplication time, consider the scenario of one source CP (P0) with three backup CPs (P1, P2, and P3) in Figure 4 with one CP per grid node. ThecorrespondingnetworktopologyforthisscenarioisshowninFigure5.ThesourceCPdivides the checkpoint data into nine chunks, chunks 1, 4, 6 for P1, chunks 2, 5, 7 for P2, and chunks 3, 6, 9 for P3.Thereplicationtimecanbeanalyzedinthreestagesasfollows:

1st stage:ThesourceCPtransfersthefirstchunk of each backup CP simultaneously over threeTCPflowsasshowninFigure6(a),whichtakesT1tofinish.For𝑛𝑛𝑏𝑏 backup CPs, B parallel flowsexistinthelinkbetweenthesourceCPand the switch, and one ow on all the other links. Because the transfer time is dictated by thelinkwithlargestnumberof flows,wehave

𝑇𝑇1 =

∆𝑅𝑅(𝑛𝑛𝑏𝑏)

where 𝑇𝑇1 =∆

𝑅𝑅(𝑛𝑛𝑏𝑏) istheper-flowTCPthroughput

(in Mbps) over a single link having 𝑛𝑛𝑏𝑏 flows.

2nd stage: In the second stage, the source CP continues to send the remaining chunks to the backup CPs, while each backup CP starts

Figure 5. Network topology used for the analysis of replication time.

Chiang Mai J. Sci. 2017; 44(2) 695

replicating the completely received chunks to its successor. As shown in Figure 6(b), while receiving chunk 4 from the source CP, P1 shares chunk 1 with P2.Then,itshareschunk4 with P2 while receiving chunk 7, and so on. Thisstagelastsfortheamountof timeforthesource CP to transfer the remaining chunks to individual backup CPs. Denote the time taken inthesecondstagebyT2. With 𝑛𝑛𝑏𝑏 CPs per backup group, the number of chunks left to be sent from the source CP to each backup CP is (𝐷𝐷/𝑁𝑁𝑏𝑏 ) - ∆ becausethefirstchunkof eachbackup CP has already been transferred in the firststage.ForeachCP𝑖𝑖 , let 𝑁𝑁𝑖𝑖 = {𝑛𝑛𝑜𝑜𝑜𝑜𝑜𝑜 ,𝑛𝑛𝑖𝑖𝑖𝑖} be a tuple representing the number of outgoing parallel sessions to the switch and the number of incoming parallel sessions from the switch. From Figure 6(b), at the source CPs, 𝑁𝑁𝑠𝑠 = {3, 0} , and all the other backup CPs have 𝑁𝑁𝑏𝑏 = {1, 2} . In general, the source CP will have 𝑁𝑁𝑠𝑠 = {𝑛𝑛𝑏𝑏 , 0} while each backup CP will have 𝑁𝑁𝑠𝑠 = {1, 2} because it has to send its chunks to its successor while receiving one chunk from the source CP and one chunk from its predecessor. Since there are |𝑊𝑊| CPs in the working group, there must be |𝑊𝑊| backup groups, one for each CP. As each CP belongs to 𝑁𝑁𝑏𝑏 = {1, 2} backup groups, we have that for each CP 𝑖𝑖 , the number of parallel flowstoandfromtheswitchisgivenby

𝑁𝑁𝑖𝑖 = {𝑛𝑛𝑖𝑖 +∑ 1, 2𝑛𝑛𝑏𝑏

𝑗𝑗=1∑ 1𝑛𝑛𝑏𝑏

𝑗𝑗=1}

= {2𝑛𝑛𝑏𝑏 , 2𝑛𝑛𝑏𝑏}

Therefore,thereexists= {2𝑛𝑛𝑏𝑏 , 2𝑛𝑛𝑏𝑏} parallel flows between each grid node and the switch in both directions, and it follows that

𝑇𝑇2 =(𝐷𝐷 𝑛𝑛𝑏𝑏⁄ ) − ∆𝑅𝑅(2𝑛𝑛𝑏𝑏)

3rd stage:ThethirdstagestartswhenthesourceCPfinishestransferringtheremainingchunks to its backup CPs, and each backup CP replicates the last chunk with its successor. As

shown in Figure 6(c), the source CP will have 𝑁𝑁𝑠𝑠 = {0, 0} , while each backup CP will have 𝑁𝑁𝑏𝑏 = {1, 1} , because it has to share its chunks to the successor while receiving one chunk fromitspredecessor.Therefore,thereexists𝑛𝑛𝑏𝑏 parallelflowsbetweeneachnodeandtheswitch in both directions. It follows that the time taken in this 3rd stage is given by

𝑇𝑇1 =∆

𝑅𝑅(𝑛𝑛𝑏𝑏)

Combining the replication time from the three stages above, we have𝑇𝑇(𝐷𝐷,𝐶𝐶,𝑝𝑝) = 𝑇𝑇1 + 𝑇𝑇2 + 𝑇𝑇3

𝑇𝑇(𝐷𝐷,∆,𝑛𝑛𝑏𝑏) = ( ∆𝑅𝑅(𝑛𝑛𝑏𝑏)

) + ((𝐷𝐷 𝑁𝑁𝑏𝑏⁄ ) − ∆𝑅𝑅(2𝑛𝑛𝑏𝑏)

)+ ( ∆𝑅𝑅(𝑛𝑛𝑏𝑏)

)

= ( 2∆𝑅𝑅(𝑛𝑛𝑏𝑏)) + (

(𝐷𝐷 𝑁𝑁𝑏𝑏⁄ ) − ∆𝑅𝑅(2𝑛𝑛𝑏𝑏) )

(1)

where, p ≤ DC and 1 ≤ 𝑛𝑛𝑏𝑏 ≤ 𝑚𝑚𝑚𝑚𝑛𝑛 (|𝑊𝑊|, |𝐷𝐷∆|)

(a) Stage 1 Replication (b) Stage 2 Replication (c) Stage 3 Replication

Figure 6. Flow patterns in different stages of the replication procedure.

3.2 Choice of Backup Group SizeFrom(1),ourgoalistofindappropriate

values of 𝑛𝑛𝑏𝑏 and ∆ thatminimizesthereplication time for a given checkpoint data sizeD.Toaccomplishso,wefirstdeterminetheexpressionof TCPper-flowthroughput𝑅𝑅(. ) Essentially,thekeytoTCPthroughputisthatTCPusestheend-to-endcongestioncontrol mechanism that keeps increasing the windowsizeeveryround-triptimeuntilreachingthemaximumwindowsizeandflowssharingthe same bottleneck link get approximately equalthroughputs.TheTCPthroughputis

Chiang Mai J. Sci. 2017; 44(2)696

proportionaltothewindowsizedividedbythe round-trip time. So, only after the window sizereachesitsmaximum(65,535bytes),wewillgetthemaximumTCPthroughput.If theamount of data transferred is too small, the TCPsessionmayfinishbeforereachingitsmaximum throughput.

Thebehaviorof TCPper-flowthroughputover a single 1 Gbps link is investigated by usingns-2simulation.Figure7(a)plotstheTCPper-flowthroughputagainstthedatatransfersize(i.e.,chunksize)atdifferentnumbersof TCPflows(𝑛𝑛𝑏𝑏 ).Foragivennumberof flows,TCPper-flowthroughputincreaseswiththedatatransfersizeandstartstoconvergetoaconstantafterthedatasizegoesbeyondMBregardless of the number of parallel sessions. Thechunksize∆ should thus be at least 1 MB toachievethemaximumTCPthroughputinthetransfer,andtheTCPthroughputbecomesindependentof thedatatransfersize.Toderivetheformof theTCPthroughputpersession,𝑅𝑅(. ) 𝑛𝑛𝑏𝑏 𝑅𝑅(. ) ,wesimulatetheTCPthroughputpersession as shown in Figure 7(b) under 1 MB datasize.Thefittedregressionmodelof thethroughput curve is given by

𝑅𝑅(𝑛𝑛) = 𝛼𝛼𝛽𝛽 + 𝑛𝑛 Mbps, (2)

𝛼𝛼 = 776.15,𝛽𝛽 = 0.92

Substituting (2) in (3) and given D and ∆ , we have

𝑇𝑇(𝑛𝑛𝑏𝑏) = 𝑐𝑐1 + (𝑐𝑐2𝑛𝑛𝑏𝑏) , (3)

where 𝑐𝑐1 =𝛽𝛽∆ + 2𝐷𝐷

𝛼𝛼 , 𝑐𝑐2 =𝐷𝐷𝐷𝐷𝛼𝛼

Another important observation from Figure 7(b) is that the aggregate throughput onthelinkincreaseswiththenumberof flows.For example, for two flows, the aggregate throughput is about 560 Mbps while for three flows,theaggregatethroughputisabout645Mbps.Thisbehaviorbenefitstherecoverytimeunder Peer-to-Peer replication, which will be discussed later in Section 4.

(a) TCP per-flow throughput vs. transfer data size at different number of flows.

(b) TCP per-flow throughput as a function of the number of flows

Figure 7.TCPper-flowthroughputunderasingle 1 Gbps link.

From (3), we see that the replication is a decreasing function of 𝑛𝑛𝑏𝑏 that converges to 2∆/𝑅𝑅(𝑛𝑛𝑏𝑏) at 𝑛𝑛𝑏𝑏 = ⌊𝐷𝐷/∆⌋ . Because using a larger backupgroupsizemeansmoreresources,weselect 𝑛𝑛𝑏𝑏 at which 𝑇𝑇(𝑛𝑛𝑏𝑏) = 𝑐𝑐1 + (𝑐𝑐2𝑛𝑛𝑏𝑏

) nolongersignificantlydecreases, say by some small ratio 𝑇𝑇(𝑛𝑛∗ + 1) − 𝑇𝑇(𝑛𝑛∗)

𝑇𝑇(𝑛𝑛∗) =𝑐𝑐2 (

1𝑛𝑛∗ + 1 − 1

𝑛𝑛∗) ,

𝑐𝑐1 + (𝑐𝑐2𝑛𝑛∗) , = 𝛿𝛿 , as 𝑛𝑛𝑏𝑏

increases.Therefore,ourgoalistofind𝑛𝑛𝑏𝑏 * such that

𝑇𝑇(𝑛𝑛∗ + 1) − 𝑇𝑇(𝑛𝑛∗)𝑇𝑇(𝑛𝑛∗) =

𝑐𝑐2 (1

𝑛𝑛∗ + 1 − 1𝑛𝑛∗) ,

𝑐𝑐1 + (𝑐𝑐2𝑛𝑛∗) , = 𝛿𝛿

𝛿𝛿𝑐𝑐1𝑛𝑛2 + 𝛿𝛿(𝑐𝑐1 + 𝑐𝑐2)𝑛𝑛 − (1 − 𝛿𝛿)𝑐𝑐2 = 0 (4)

Note that (4) has two roots with opposite signs. It follows that 𝑛𝑛𝑏𝑏 * is the positive root of (4), which is given by

𝑛𝑛∗ = −𝛿𝛿𝑐𝑐1 +√𝛿𝛿(𝑐𝑐1 + 𝑐𝑐2)2 − 4𝛿𝛿(1− 𝛿𝛿)𝑐𝑐1𝑐𝑐22𝛿𝛿𝑐𝑐1

(5)

andthechoiceof backupgroupsizeis

Chiang Mai J. Sci. 2017; 44(2) 697

𝐵𝐵 = 𝑚𝑚𝑚𝑚𝑚𝑚 (⌈𝑚𝑚∗⌉, |𝑊𝑊|, ⌊𝐷𝐷∆⌋) (6)

As an example, with 1-GB checkpoint data, 1-MBchunksize(D=1000MBand∆ = 1 MB) and a large working group, we have 𝑐𝑐1 =

𝛽𝛽∆ + 2𝐷𝐷𝛼𝛼 =

2.5780, 𝑐𝑐2 =𝐷𝐷𝐷𝐷𝛼𝛼 = 1.1853. For 𝑇𝑇(𝑛𝑛∗ + 1) − 𝑇𝑇(𝑛𝑛∗)

𝑇𝑇(𝑛𝑛∗) =𝑐𝑐2 (

1𝑛𝑛∗ + 1 − 1

𝑛𝑛∗) ,

𝑐𝑐1 + (𝑐𝑐2𝑛𝑛∗) , = 𝛿𝛿 = 0.02, the backup

groupsizecalculatedfrom(6)is𝐵𝐵 = 𝑚𝑚𝑚𝑚𝑚𝑚 (⌈𝑚𝑚∗⌉, |𝑊𝑊|, ⌊𝐷𝐷∆⌋) = 4. Observe from (3) that if ∆ ≪ 𝐷𝐷 , 𝑐𝑐1 =

𝛽𝛽∆ + 2𝐷𝐷𝛼𝛼 and 𝑐𝑐2 =

𝐷𝐷𝐷𝐷𝛼𝛼

are approximately ∆ ≪ 𝐷𝐷 scaled by constants, and the solution of 𝑛𝑛𝑏𝑏 * in (5) becomes independent of ∆ ≪ 𝐷𝐷 .Therefore,thechoiceof backupgroupsizecanbesettofour.

3.3 Simulation ResultsThissectionevaluatesthereplication

time obtained from the analysis in Section 3.2againstns-2simulation.Therecoverytimeunder Peer-to-Peer replication and client-server replication are also compared. We simulated a grid environment with ten grid nodes under the star network topology as in Figure 5 with 1-Gbps links and a 200 microseconds of the end-to-end propagation delay.

Figure 8 shows the replication time as afunctionof thebackupgroupsize(B). Theresultsfromanalysisandsimulationareconsistent and clearly validate the analysis that asthebackupgroupsizeincreasesbeyondfour,the replication time starts to converge.

For the recovery process, the time to recover the checkpoint data from backup CPs to a replacement CP is observed after a node failure. Because the replacement CP needs to be created by GRAM, the recovery overhead occurs when backup CPs of the failed CP send backup chunks to the replacement CP. We measure the recovery time from the start of backup transfer. Figure 9 shows the recovery times under Peer-to-Peer replication compared to the client-server replication at differentbackupgroupsizes.Asexpected,the recovery time increases linearly with the checkpointdatasize.However,therecovery

time under Peer-to-Peer replication are always smaller because the aggregate throughput on thelinkincreaseswiththenumberof flowsasshownearlierinFigure7(b).Theresultssuggestthat our proposed Peer-to-Peer fault tolerance approach can reduce both replication time and recovery time compared to the traditional client-server approach.

4. CONCLUSIONSA peer-to-peer fault tolerance approach in

a grid computing system is proposed to reduce the replication time and the recovery time in thebackupprocess.Thebackupprocessisperformed by nodes in a working group to replicatecheckpointdatainparallel.Themain

Figure 8. Replication time as a function of the backupgroupsizeatdifferentcheck-pointdatasizes(∆=1MB,|W|=10nodes).

Figure 9. Comparison of the recovery time.

Chiang Mai J. Sci. 2017; 44(2)698

procedures including group forming, replication, and failure recovery are described and the replicationtimeisanalyzedunderabasicstartopology to obtain suitable parameter values. Our mathematical analysis, also validated against simulation, reveals that negligible reduction in the replication time is gained once the backup groupsizeisbeyondfour.Furthermore,therecovery time under the peer-to-peer approach is always better than that of the client-server approach due to simultaneous transfers of data among compute processes. Our replication time analysisassumesthefixedredundancylevelof two and a simple star topology. However, there may be a need of a higher redundancy level in some cases and the system may have a more complicatednetworktopology.Thestudyof how our peer-to-peer fault tolerance approach performs under more generic circumstances will be left for the future work.

ACKNOWLEDGEMENTSThisworkissupportedbytheThailand

Research Fund and King Mongkut’s University of TechnologyThonburithroughtheRoyalGolden Jubilee Ph.D. Program under Grant No. PHD/0247/2549.

REFERENCES[1] Agbaria A. and Friedman R., J. Clust. Comput.,

1999; 6: 167-176.

[2] Charoenchaiamornkij P. andAchalakulT.,Proceeding of Technology and Innovation for Sustainable Development Conference 2008, 2008.

[3] LeeJ.,ChapinS.andTaylorS., J. Qual. Reliab. Eng. Int., 2002; 18.

[4] Abewajy J., Proceeding of International Conference on Computational Science and Its Applications, 2004; 107-115.

[5] BudhirajaN.,MarzulloK.,SchneiderF.B.andTouegS.,Proceeding of the 6th International Workshop on Distributed Algorithms, 1993.

[6] GrabrielE.,FaggG.,BukovskyA.,AngskunT.and Dongarra J., Proceeding of 17th Annual ACM International Conference on Supercomputing, 2003, 2003.

[7] Ouyang J. and Maheshwari P., Proceedings of IEEE Second International Conference on Algorithms and Architectures for Parallel Processing, 1996.

[8] OzakiT.,DohiT.andOkamuraH.,IEEE T. Depend. Secure Comput., 2006; 2: 130-140.

[9] ZhangX.,ZagorodnovD.,HiltunenM.,MarzulloK. and Schlichting R., Proceeding of IEEE International Conference on Cluster Computing, 2004; 105-114.

[10] Garg R. and Singh A., J. Comput. Sci. Eng. Survey, 2011; DOI 10.5121/ijcses.2011.2107.

[11] Gottumukkala N., Failure Analysis and Reliability-Aware Resource Allocation of Parallel Applications in High Performance Computing Systems, Ph D Thesis, LouisianaTechUniversity,USA,2008.

[12] Gottumukkala N., Liu Y., Leangsuksun C., Nassar R. and Scott S., Proceeding of High Availability and Performance Workshop 2006 in conjunction with Los Alamos Computer Science Institute Symposium, 2006.

[13] Luckow A., J. Future Gener. Comp. Sy., 2008; 24: 142-152.

[14] Naksinehaboon N., Paun M., Nassar R., Leangsuksun C. and Scott S., Int. J. Comput. Commun. Control, 2009; 4: 386-400.

[15] Nandagopal M. and Uthariaraj v., Int. J. Eng.Sci. Technol., 2010; 2: 4361- 4372.

[16] Oliner A., Rudolph L. and Sahoo, Proceeding of the 20th Annual International Conference on Supercomputing, 2006; 14-23.

[17] Al-KiswanyS.,RipeanuM.,VazhkudaiS.andGharaibeh A., Proceeding of The 28th International Conference on Distributed Computing Systems 2008, 2008; 613-624.

[18] Song U., Gil J. and Hong S., Checkpoint Sharing-Based Replication Scheme in Desktop Grid Computing, Embedded and Multimedia Computing Technology and Service Lecture Notes in Electrical Engineering, 2012; 477-484.

[19] TangmankhongT., SiripongwutikornP. andAchalakulT.,Proceeding of the 9th International Joint Conference on Computer Science and Software Engineering (JCSSE2012),Bangkok,Thailand,30 May 2012 – 1 Jun 2012.

[20] TheGlobusAlliance,AbouttheGlobusToolkit,Available at: http://About theGlobusToolkit.

[21] UW-MadisonResearch,HTCondor,Availableat: http://research.cs.wisc.edu/htcondor/.

design and analysis of peer-to-peer fault-tolerance …thagorn tangmankhong*, peerapon...

Documents