dynamic load balancing of samr applications on distributed...

Scientific Programming 10 (2002) 319–328 319IOS Press

Dynamic load balancing of SAMRapplications on distributed systems1

Zhiling Lana, Valerie E. Taylorb and Greg Bryanc

aDepartment of Computer Science, Illinois Institute of Technology, Chicago, IL 60616, USAE-mail: [email protected] of Electrical and Computer Engineering, Northwestern University, Evanston, IL 60208, USAE-mail: [email protected] and Astrophysics Laboratory, Oxford University, Oxford, OX13RH, UKE-mail: [email protected]

Abstract. Dynamic load balancing(DLB) for parallel systems has been studied extensively; however, DLB for distributed systemsis relatively new. To efficiently utilize computing resources provided by distributed systems, an underlying DLB scheme mustaddress both heterogeneous and dynamic features of distributed systems. In this paper, we propose a DLB scheme for StructuredAdaptive Mesh Refinement(SAMR) applications on distributed systems. While the proposed scheme can take into consideration(1) the heterogeneity of processors and (2) the heterogeneity and dynamic load of the networks, the focus of this paper is onthe latter. The load-balancing processes are divided into two phases: global load balancingand local load balancing. We alsoprovide a heuristic method to evaluate the computational gain and redistribution cost for global redistribution. Experiments showthat by using our distributed DLB scheme, the execution time can be reduced by 9%–46% as compared to using parallel DLBscheme which does not consider the heterogeneous and dynamic features of distributed systems.

Keywords: Dynamic load balancing, distributed systems, adaptive mesh refinement, heterogeneity, dynamic network loads

1. Introduction

Structured Adaptive Mesh Refinement (SAMR) is atype of multiscale algorithm that dynamically achieveshigh resolution in localized regions of multidimen-sional numerical simulations. It shows incredible po-tential as a means of expanding the tractability of avariety of numerical experiments and has been suc-cessfully applied to model multiscale phenomena ina range of disciplines, such as computational fluiddynamics, computational astrophysics, meteorologicalsimulations, structural dynamics, magnetic, and ther-

1Zhiling Lan is supported by a grant from the National Computa-tional Science Alliance (ACI-9619019), Valerie Taylor is supportedin part by a NSF NGS grant (EIA-9974960), and Greg Bryan is sup-ported in part by a NASA Hubble Fellowship grant (HF-01104.01-98A).

mal dynamics. A typical SAMR application may re-quire a large amount of computing power. For example,simulation of the first star requires a few days to executeon four processors of a SGI Origin2000 machine; how-ever, the simulation is not sufficient, for which thereare some unresolved scales that would result in longerexecution time [3]. Simulation of the galaxy forma-tion requires more than one day to execution on 128processors of the SGI Origin2000 and requires morethan 10GB of memory. Distributed systems provide aneconomical alternative to traditional parallel systems.By using distributed systems, researchers are no longerlimited by the computing power of a single site, andare able to execute SAMR applications that require vastcomputing power (e.g., beyond that available at anysingle site). A number of national technology gridsare being developed to provide access to many com-pute resources regardless of the location, e.g., GUSTO

ISSN 1058-9244/02/$8.00 2002, ACM. Reprinted with permission from Proceedings of ACM Supercomputing 2001, 10–16 November,Denver, CO, USA. ACM portal: www.acm.org.

320 Z. Lan et al. / Dynamic load balancing of SAMR applications on distributed systems

testbed[11], NASA’s Information Power Grid[13], Na-tional Technology Grid[25]; several research projects,such as Globus[11] and Legion [12], are developingsoftware infrastructures for ease of use of distributedsystems.

Execution of SAMR applications on distributed sys-tems involves dynamically distributing the workloadamong the systems at runtime. A distributed sys-tem may consist of heterogeneous machines connectedwith heterogeneous networks; and the networks may beshared. Therefore, to efficiently utilize the computingresources provided by distributed systems, the under-lying dynamic load balancing (DLB) scheme must takeinto consideration the heterogeneous and dynamic fea-tures of distributed systems. DLB schemes have beenresearched extensively, resulting in a number of pro-posed approaches [14,7,8,17,21,22,24,26,27]. How-ever, most of these approaches are inadequate for dis-tributed systems. For example, some schemes assumethe multiprocessor system to be homogeneous, (e.g.,all the processors have the same performance and theunderlying networks are dedicated and have the sameperformance). Some schemes consider the system tobe heterogeneous in a limited way (e.g., the processorsmay have different performance but the networks arededicated). To address the heterogeneity of processors,a widely-used mechanism is to assign a relative weightwhich measures the relative performance to each pro-cessor. For example, Elsasser et al. [9] generalize exist-ing diffusive schemes for heterogeneous systems. Theirscheme considers the heterogeneity of processors, butdoes not address the heterogeneity and dynamicity ofnetworks. In [5], a parallel partitioning tool ParaPARTfor distributed systems is proposed. ParaPART takesinto consideration the heterogeneity of both processorsand networks; however, it is a static scheme and doesnot address the dynamic features of the networks orthe application. Similar to PLUM [21], our schemealso use some evaluation strategies; however, PLUMaddresses the issues related to homogeneous systemswhile our work is focused on heterogeneous systems.KeLP [14] is a system that provides block structureddomain decomposition for SPMD. Currently, the focusof KeLP is on distributed memory parallel computers,with future focus on distributed systems.

In this paper, we proposed a dynamic load balanc-ing scheme for distributed systems. This scheme takesinto consideration (1) the heterogeneity of processorsand (2) the heterogeneity and dynamic load of the net-works. Our DLB scheme address the heterogeneity ofprocessors by generating a relative performance weight

for each processor. When distributing workload amongprocessors, the load is balanced proportional to theseweights. To deal with the heterogeneity of network, ourscheme divides the load balancing process into globalload balancing phaseand local load balancing phase.The primary objective is to minimize remote commu-nication as well as to efficiently balance the load on theprocessors. One of the key issues for global redistri-bution is to decide when such an action should be per-formed and whether it is advantageous to do so. Thisdecision making process must be fast and hence basedon simple models. In this paper, a heuristic method isproposed to evaluate the computational gain and the re-distribution cost for global redistributions. The schemeaddresses the dynamic features of networks by adap-tively choosing an appropriate action based on the cur-rent observation of the traffic on the networks.

While our DLB takes into consideration the two fea-tures, the experiments presented in this paper focus onthe heterogeneity and dynamic load of the networksdue to the limited availability of distributed systemtestbeds. The compute nodes used in the experimentsare dedicated to a single application and have the sameperformance. Experiments show that by using this dis-tributed DLB scheme, the total execution time can bereduced by 9%–46% and the average improvement ismore than 26%, as compared with using parallel DLBscheme which does not consider the heterogeneous anddynamic features of distributed systems. While the dis-tributed DLB scheme is proposed for SAMR applica-tions, the techniques can be easily extended to otherapplications executed on distributed systems.

The remainder of this paper is organized as follows.Section 2 introduces SAMR algorithm and its parallelimplementation ENZO code. Section 3 identifies thecritical issues of executing SAMR applications on dis-tributed systems. Section 4 describes our proposed dy-namic load balancing scheme for distributed systems.Section 5 presents the experimental results comparingthe performance by this distributed DLB scheme withparallel DLB scheme which does not consider the het-erogeneous and dynamic features of distributed sys-tems. Finally, Section 6 summarizes the paper andidentifies our future work.

2. Structured adaptive mesh refinementapplications

This section gives an overview of the SAMR method,developed by M. Berger et al., and ENZO, a parallel

Z. Lan et al. / Dynamic load balancing of SAMR applications on distributed systems 321

OverallStructure

Hierarchy

Level 1

Level 0

Level 2

Level 3

Fig. 1. SAMR grid hierarchy.

implementation of this method for astrophysical andcosmological applications. Additional details aboutENZO and the SAMR method can be found in [2,1,19,3,20].

2.1. Layout of grid hierarchy

SAMR represents the grid hierarchy as a tree of gridsat any instant of time. The number of levels, the num-ber of grids, and the locations of the grids change witheach adaptation. That is, a uniform mesh covers theentire computational volume and in regions that requirehigher resolution, a finer subgrid is added. If a re-gion needs still more resolution, a even finer subgridis added. This process repeats recursively with eachadaptation resulting in a tree of grids like that shown inFig. 1 [19]. The top graph in this figure shows the over-all structure after several adaptations. The remainder ofthe figure shows the grid hierarchy for the overall struc-ture with the dotted regions corresponding to those thatunderwent further refinement. In this grid hierarchy,there are four levels of grids going from level 0 to level3. Throughout execution of an SAMR application, thegrid hierarchy changes with each adaptation.

2.2. Integration execution order

The SAMR integration algorithm goes through thevarious adaptation levels advancing each level by anappropriate time step, then recursively advancing to thenext finer level at a smaller time step until it reaches thesame physical time as that of the current level. Figure 2illustrates the execution sequence for an applicationwith four levels and a refinement factor of 2. First westart with the first grid on level 0 with time step dt.Then the integration continues with one of the subgrids,found on level one, with time step dt/2. Next, theintegration continues with one of the subgrids on level2, with time step dt/4, followed by the analysis of thesubgrids on level 3 with time step dt/8. The figureillustrates the order for which the subgrids are analyzedwith the integration algorithm.

2.3. ENZO: A parallel implementation of SAMR

Although the SAMR strategy shows incredible po-tential as a means for simulating multiscale phenomenaand has been available for over two decades, it is still notwidely used due to the difficulty with implementation.The algorithm is complicated because of dynamic na-ture of memory usage, the interactions between differ-ent subgrids and the algorithm itself. ENZO [3] is one


10th 13th

4th 5th 7th 8th 11th 12th 14th 15th

3rd 6th

9th2nd

1st

Level 1

level 0

Level 2

Level 3

Fig. 2. Integrated execution order (refinement factor = 2).

of the successful parallel implementations of SAMR,which is primarily intended for use in astrophysics andcosmology. It is written in C++ with Fortran routinesfor computationally intensive sections and MPI func-tions for message passing among processors. ENZOwas developed as a community code and is currently inuse on at least six different sites.

In [15,16], a DLB scheme was proposed for SAMRon parallel systems. It was designed for efficient exe-cution on homogeneous systems by considering someunique adaptive characteristics of SAMR applications.In the remainder of this paper, we denote this schemeas parallel DLB scheme.

3. Issues and motivations

In this section we compare the performance of ENZOexecuted on a parallel machine with that executed on adistributed system. It is well-known that the distributedsystem will have a larger execution time than the paral-lel system with the same number of processors becauseof the performance of the WANs used to interconnectthe machines in a distributed system. WANs gener-ally have much larger latency than the interconnectsfound in parallel machines. However, the comparisonis given in this paper to illustrate the amount of over-head introduced by the WAN in the distributed system,which is the focus of this paper. Our DLB attemptsto reduced this overhead to make distributed systemsmore efficient. The experiments use small numbers ofprocessors to illustrate the concepts, but it is assumedthat in practice the distributed system will have a largenumber of processors to provide the needed computepower, which is beyond any single, available parallelmachine.

The experiment used for the comparison uses theparallel DLB schemeon both the parallel and dis-

tributed systems. The parallel system consists of a250 MHz R10000 SGI Origin2000 machine at Ar-gonne National Lab (ANL); the parallel executablewas compiled with SGI implemented MPI. The dis-tributed system consists of two geographically dis-tributed 250 MHz R10000 SGI Origin2000 machines:one located at ANL and the other located at NationalCenter for Supercomputing Applications(NCSA). Themachines are connected by the MREN network con-sisting of ATM OC-3 networks. The distributed ex-ecutable was compiled with the grid-enabled imple-mentation MPICH-G2 provided by ANL [18]. Weused Globus [11] tool to run the application on thedistributed system. The experiments used the datasetShockPool3D, which is described in detail in Section 5.

Five configurations (1 + 1, 2 + 2, 4 + 4, 6 + 6, and8+8) are tested. For the distributed system, the config-uration 4 + 4 implies four processors at ANL and fourprocessor at NCSA; for the parallel system this config-uration implies eight processors at ANL. The results aregiven in Fig. 3. For all the configurations, the times forparallel computation and distributed computation aresimilar as expected since the ANL and NCSA proces-sors have the same performance. However, since thedistributed system consists of two remotely connectedmachines and the connection is a shared network, timesfor distributed communication are much larger thanthose for parallel communication. Therefore, in orderto get higher performance from distributed systems, thekey issues are how to reduce remote communicationand how to adaptively adjust to the dynamic feature ofnetworks. These results motivate us to design a dis-tributed DLB scheme that considers the heterogeneityin processors and the heterogeneity and dynamic loadof the networks.


Fig. 3. Comparison of parallel and distributed execution.

4. Distributed dynamic load balancing scheme

In this section, we present a DLB scheme for SAMRapplications on distributed systems. To address the het-erogeneity of processors, each processor is assigned arelative weight. To deal with the heterogeneity of net-works, the scheme divides the load balancing processinto two steps: global load balancing phaseand localload balancing phase. Further, the proposed schemeaddresses dynamic feature of networks by adaptivelychoosing an appropriate action according to the traf-fic on them. The details are given in the followingsubsections.

4.1. Description

First, we define a “group” as a set of processorswhich have the same performance and share an intra-connected network; a group is a homogeneous system.A group can be a shared-memory parallel computer,a distributed-memory parallel computer, or a clusterof workstations. Communications within a group arereferred as local communication, and those betweendifferent groups are remote communications. A dis-tributed system is composed of two or more groups.This terminology is consistent with that given in [6]

Our distributed DLB scheme entails two steps toredistribute the workload: global load balancing phaseand local load balancing phase, which are described indetail below.

– Global Load Balancing PhaseAfter each time-step at level 0 only, the schemeevaluates the load distribution among the groupsby considering both heterogeneous and dynamic

features of the system. If imbalance is detected, aheuristic method described in the following sub-sections is invoked to calculate the computationalgain of removing the imbalance and the overheadof performing such a load redistribution amonggroups. If the computational gain is larger than theredistribution overhead, this step will be invoked.All the processors will be involved in this process,and both global and local communications are con-sidered. Workload will be redistributed by consid-ering the heterogeneity of number of processorsand processor performance of each group.

– Local Load BalancingAfter each time-step at the finer levels, each groupentails a balancing process within the group. Theparallel DLB schemeas mentioned in section 2.3is invoked, that is, the workload of each groupis evenly and equally distributed among the pro-cessors. However, load balancing is only allowedwithin the group. An overloaded processor canmigrate its workload to an underloaded processorof the same group only. In this manner, childrengrids are always located at the same group as theirparent grids; thus no remote communication isneeded between parent and children grids. Theremay be some boundary information exchange be-tween sibling grids which usually is very small.During this step, load imbalance may be detectedamong groups at the finer levels, however, theglobal balancingprocess will not be invoked untilthe next time-step at level 0.

The flow control of this scheme is shown in Fig. 4.Here, time(i) denotes the iterative time for level i, anddt(i) is the time-step for level i. The left part of the


figure illustrates the global load balancing step, whilethe right part represents the local load balancing step.Following each time-step dt(0) at level 0, there areseveral smaller time-steps dt(i) at a finer level i untilthe finer level i reaches the same physical time as thatof level 0. Figure 5 illustrates the executing pointsof global balancingand local balancingprocesses interms of the execution order given in Fig. 2. It isshown that the local balancingprocess may be invokedafter each smaller time-step while the global balancingprocess may be invoked after each time-step of the toplevel only. Therefore, there are fewer global balancingprocesses during the run-time as compared to localbalancingprocesses.

4.2. Cost evaluation

To determine if a global redistribution is invoked,an efficient evaluation model is required to calculatethe redistribution cost and the computational gain. Theevaluation should be very fast to minimize the overheadimposed by the DLB. Basically, the redistribution costconsists of both communicational and computationaloverhead. The communicational overhead includes thetime to migrate workload among processors. The com-putational overhead includes the time to partition thegrids at the top level, rebuild the internal data structures,and update boundary conditions.

We propose a heuristic method to evaluate the redis-tribution cost as follows. First, the scheme checks theload distribution of the system. If imbalance exists, thescheme calculates the amount of load needed to migratebetween groups. In order to adaptively calculate com-munication cost, the network performance is modeledby the conventional model, that is Tcomm = α+β×L.Here Tcomm is the communication time, α is the com-munication latency, β is the communication transferrate, and L is the data size in bytes. Then the schemesends two messages between groups, and calculatesthe network performance parameters α and β. If theamount of workload need to be redistributed is W , thecommunication cost would be α+ β ×W . This com-munication model is very simple so little overhead isintroduced.

To estimate the computational cost, the scheme useshistory information, that is, recording the computa-tional overhead of the previous iteration. We denotethis portion of cost as δ. Therefore, the total cost forredistribution is:

Cost = (α+ β ×W ) + δ (1)

4.3. Gain evaluation

SAMR allows local refining and coarsening of thegrid hierarchy during the execution to capture the phe-nomena of interest, resulting in dynamically changingworkload. The total amount of workload at the time tmay not be the same as that at the time t+dt. However,the difference is usually not very much between timesteps. Thus, the scheme predicts the computationalgain by the following heuristic method. Between twoiterations at level 0, the scheme records several perfor-mance data, such as the amount of load each proces-sor has for all levels, the number of iterations for eachfiner level, and the execution time for one time-step atlevel 0. Note that there are several iterative steps foreach finer level between two iterations at level 0 (seeFig. 5). For each group, the total workload (includingall the levels) is calculated for one time-step at level0 using this recorded data. Then the difference of to-tal workload between groups is estimated. Lastly, thecomputational gain is estimated by using the differenceof total workload and the recorded execution time ofone iteration at the top level. The detailed formula areas follows:

W igroup(t) =

∑

proc∈group

wiproc(t) (2)

Wgroup(t) =∑

0�i�maxlevel

W igroup(t)

×N iiter(t) (3)

Gain = T (t)

× max(Wgroup(t)) −min(Wgroup(t))Number Groups×max(Wgroup(t))

(4)

Here, Gain denotes the estimated computational gainfor global load balancing at time t;w i

proc(t) is the work-load of processor proc at level i for time t; W i

group(t)is the total amount of load of group at level i for timet; N i

iter(t) is the number of iterative steps of level ifrom the time t to t + dt; Wgroup(t) denotes the totalworkload of group at time t; and T (t) is the executiontime from the time (t − dt) to t, a time step. Hence,the gain provides a very conservative estimate of theamount of decrease in execution time that will occurfrom the redistribution of load resulting from the DLB.

4.4. Global load redistribution

The global load redistribution is invoked when thecomputational gain is larger than some factor times the


time(i)=time(i)+dt(i)

Global Load Balancing: transfer grids at level 0 by considering the heterogeneousand dynamic features of the systems

solver at level 0

stop

time(0)=time(0)+dt(0)

YesYes

No

time(i)>time(0)

gain > γ *cost

time(0)>stop_time

No

Yes

at level i within groupLocal Load Balancing: transfer grids evenly

solver at level i

imbalance exist?

Yes

No No

Fig. 4. Flowchart of distributed DLB.

Global Balancing

Local Balancing

10th 13th

4th 5th 7th 8th 11th 12th 14th 15th

3rd 6th

9th2nd

1st

Level 0

Level 1

Level 2

Level 3

Fig. 5. Integrated execution order showing points of load balancing.

redistribution cost, that is, when Gain > γ × Cost.Here, γ is a user-defined parameter (default is 2.0)which identifies how much the computational gain mustbe for the redistribution to be invoked. The detailedsensitivity analysis of this parameter will be includedin our future work. During the global redistributionstep, the scheme redistributes the workload by con-sidering the heterogeneity of processors. For exam-ple, suppose the total workload is W , which needsto be partitioned into two groups. Group A consistsof nA processors and each processor has the perfor-mance of pA; group B consists of nB processors andeach processor have the performance of pB . Then theglobal balancingprocess will partition the workloadinto two portions: (W × nA×pA

nA×pA+nB×pB) for group

A and (W × nB×pB

nA×pA+nB×pB) for group B. Basically,

this step entails moving the groups’ boundaries slightlyfrom underloaded groups to overloaded groups so as tobalance the system. Further, only the grids at level 0are involved in this process and the finer grids do notneed to be redistributed. The reason is that the finergrids would be reconstructed completely from the gridsat level 0 during the following smaller time-steps.

Figure 6 shows an example of global redistribution.The top graph shows the overall grid hierarchy at timet. Group A is overloaded because more refinementsoccur in its local regions. After calculating the gainand the cost of a global redistribution by using theheuristic methods proposed in the previous subsections,the scheme determines that performing a global load


(c) level 0 grids after global redistribution(b) level 0 grids at time t

workload redistributed from group A to Group B.

Group A Group B

Proc #0

Proc #1

Proc #2

Proc #3

Global

Redistribution

Proc #0

Proc #1

Proc #2

Proc #3

Group A Group B

Proc #0

Proc #1

Proc #2

Proc #3

Group A Group B

(a) overall grid hierarchy at time t

Fig. 6. An example for global redistribution.

balancing is advantageous; thus, a global redistribu-tion is invoked. Figure 6(b) shows the level 0 gridsat time t; and Fig. 6(c) represents the level 0 gridsafter global redistribution process. The boundary be-tween two groups are moved slightly to the overloadedgroup. The amount of level 0 grids in the shade, aboutWA(t)−WB(t)

2×WA(t) ×W 0A(t), is redistributed from the over-

loaded Group A to the underloaded Group B.

5. Experimental results

In this section, we present the experimental results ontwo real SAMR datasets comparing the proposed dis-tributed DLB with parallel DLB scheme on distributedsystems. Both of the executables were compiled withMPICH-G2 library [18] and Globustoolkit was used tolaunched the experiments. Since the processors in thesystems (described below) have the same performance,the difference in performance between two DLBs re-flects the advantage of taking into consideration theheterogeneity and dynamic load of the networks.

One dataset is AMR64 and the other is Shock-Pool3D. ShockPool3Dsolves a purely hyperbolic equa-tion, while AMR64uses hyperbolic (fluid) equation and

elliptic (Poisson’s) equation as well as a set of ordinarydifferential equations for the particle trajectories. Thetwo datasets have different adaptive behaviors. AMR64is designed to simulate the formation of a cluster ofgalaxies, so many grids are randomly distributed acrossthe whole computational domain; ShockPool3Dis de-signed to simulate the movement of a shock wave (i.e.,a plane) that is slightly tilted with respect to the edgesof the computational domain, so more and more gridsare created along the moving shock wave plane.

Two distributed systems are tested in our experi-ments: one consists of two machines at ANL con-nected by a local area network (fiber-based GigabitEthernet); the other is a system connecting two ma-chines at ANL and NCSA by a wide area network,MREN that has ATM OC-3 networks. Note that bothnetworks are shared networks and all the machines are250MHz SGI Origin2000. AMR64 is tested on theLAN-connected system, and ShockPool3Dis tested onthe WAN-connected system. For each configuration,the distributed scheme was executed immediately fol-lowing the parallel scheme. This was done so that thetwo executions would have the similar network envi-ronments (e.g., periods of high traffic due to sharingof the networks or low traffic); thus, the performance


Fig. 7. Execution time for AMR64 and ShockPool3D.

Fig. 8. Efficiency for AMR64 and ShockPool3D.

difference shown in the below is attributed to the dif-ference in the DLB schemes.

Figure 7 compares the total execution times withvarying configurations for both datasets. It is shownthat the total execution time by using the proposed dis-tributed DLB is reduced greatly as compared to usingparallel DLB, especially as the number of processors isincreased. The relative improvements are as follows:for AMR64, it is in the range of 9.0%–45.9% and theaverage improvement is 29.7%; for ShockPool3D, it isranging from 2.6% to 44.2% and the average improve-ment is 23.7%.

Figure 8 gives the comparison of efficiency withvarying configurations. Here, the efficiency is definedas: efficiency = E(1)

E×P , where E(1) is the sequentialexecution time on one processor, E is the executiontime on the distributed system, and P is equal to thesummation of each processor’s performance relative tothe performance used for sequential execution [4]. Inthis experiment, all the processors have the same perfor-mance, so this P is equal to the number of processors.As we can observe, the efficiency by using distributedDLB is improved significantly. For AMR64, the effi-ciency is improved by 9.9%–84.8%; for ShockPool3D,the efficiency is increased by 2.6%–79.4%.

6. Summary and future work

In this paper, we proposed a dynamic load balancingscheme for distributed systems. This scheme takes intoconsideration (1) the heterogeneity of processors and(2) the heterogeneity and dynamic load of networks. Toaddress the heterogeneity of processors, each proces-sor is assigned a relative performance weight. Whendistributing workload among processors, the load isdistributed proportionally to these weights. To dealwith the heterogeneity of network, the scheme dividesthe load balancing process into global load balancingphaseand local load balancing phase. Further, thescheme addresses the dynamicity of networks by adap-tively choosing an appropriate action based on the ob-servation of the traffic on the networks. For global re-distribution, a heuristic method was proposed to evalu-ate the computational gain and the redistribution cost.The experiments, however, illustrate the advantages ofour DLB to handle the heterogeneity and dynamic loadof the networks. The experiments show that by usingthis distributed DLB scheme, the total execution timecan be reduced by 9%–46% and the average improve-ment is more than 26%, as compared to using parallel


DLB scheme which does not consider the heteroge-neous and dynamic features of distributed systems.

Our future work will focus on including more het-erogeneous machines and larger real datasets into ourexperiments. Further, we will connect this proposedDLB scheme with tools such as the NWS service [28]to get more accurate evaluation of underlying networks.Lastly, a detailed sensitivity analysis of parameters usedin this distributed DLB scheme will also be completed.

References

[1] S. Baden, N. Chrisochoides, M. Norman and D. Gannon.Structured Adaptive Adaptive Mesh Refinement (SAMR) GridMethods, IMA Volumes in Mathematics and its Applications,vol. 117 Springer-Verlag, NY, 1999.

[2] M. Berger and P. Colella, Local adaptive mesh refinementfor shock hydrodynamics, Journal of Computational Physics82(1) (1989), 64–84.

[3] G. Bryan, Fluid in the universe: Adaptive mesh refinementin cosmology, Computing in Science and Engineering1(2)(March/April, 1999), 46–53.

[4] J. Chen, Mesh Partitioning for Distributed Systems,Ph.D.Thesis, Northwestern University, 1999.

[5] J. Chen and V. Taylor, ParaPART: Parallel Mesh Partition-ing Tool for Distributed Systems, Concurrency: Practice andExperience12 (2000), 111–123.

[6] J. Chen and V. Taylor, PART: Mesh Partitioning for EfficientUse of Distributed Systems, To appear in IEEE Transactionson Parallel and Distributed Systems.

[7] G. Cybenko, Dynamic load balancing for distributed mem-ory multiprocessors, IEEE Transactions on Parallel and Dis-tributed computing7 (October, 1989), 279–301.

[8] K. Dragon and J. Gustafson, A low-cost hypercube load bal-ance algorithm,Proc. Fourth Conf. Hypercubes, ConcurrentComput. and Appl., 1989, pp. 583–590.

[9] R. Elsasser, B. Monien and R. Preis, Diffusive Load Balanc-ing Schemes for Heterogeneous Networks,Proc. SPAA’2000,Maine, 2000.

[10] I. Foster and C. Kesselman, The Grid: Blueprint for a NewComputing Infrastructure, Morgan Kaufmann Publishers, SanFrancisco, California, 1999.

[11] Globus Project Team, Globus Project, World Wide Web,http://www.globus.org, 1996.

[12] A. Grimshaw and W. Wulf, The Legion Vision of a World-

wide Virtual Computer, Communications of the ACM40(1)(January, 1997).

[13] W. Johnston, D. Gannon and B. Nitzberg, Grids as Produc-tion Computing Environments: The Engineering Aspects ofNASA’s Information Power Grid,IEEE Computer SocietyPress, 1999.

[14] The KeLP Programming System, World Wide Web, http://www-cse.ucsd.edu/groups/hpcl/scg/kelp.html.

[15] Z. Lan, V. Taylor and G. Bryan, Dynamic Load Balancing ForStructured Adaptive Mesh Refinement Applications,Proc. ofICPP’2001, Valencia, Spain, 2001.

[16] Z. Lan, V. Taylor and G. Bryan, Dynamic Load Balancing ForAdaptive Mesh Refinement Applications: Improvements andSensitivity Analysis,Proc. of IASTED PDCS’2001, Anaheim,CA, 2001.

[17] F. Lin and R. Keller, The gradient model load balancing meth-ods, IEEE Transactions on Software Engineering13(1) (Jan-uary, 1987), 8–12.

[18] MPICH Project Team, World Wide Web, http://www.niu.edu/mpi/.

[19] H. Neeman, Autonomous Hierarchical Adaptive Mesh Refine-ment for Multiscale Simulations,Ph.D. Thesis, UIUC, 1996.

[20] M. Norman and G. Bryan, Cosmological adaptive mesh re-finement,Computational Astrophysics, 1998.

[21] L. Oliker and R. Biswas, PLUM: parallel load balancing foradaptive refined meshes, Journal of Parallel and DistributedComputing47(2) (1997), 109–124.

[22] K. Schloegel, G. Karypis and V. Kumar, Multilevel diffusionschemes for repartitioning of adaptive meshes, Journal of Par-allel and Distributed Computing47(2) (1997), 109–124.

[23] S. Sinha and M. Parashar, Adaptive Runtime Partitioning ofAMR Applications on Heterogeneous clusters,submitted tothe 3rd IEEE International Conference on Cluster Computing,Newport Beach, CA, March, 2001.

[24] A. Sohn and H. Simon, Jove: A dynamic load balancingframework for adaptive computations on an sp-2 distributedmultiprocessor,NJIT CIS Technical Report, New Jersey, 1994.

[25] R. Steven, P. Woodward, T. DeFanti and C. Catelett, From theI-WAY to the National Technology Grid, Communications ofthe ACM40(11) (1997), 50–61.

[26] C. Walshaw, Jostle: partitioning of unstructured meshes formassively parallel machines,Proc. Parallel CFD’94, 1994.

[27] M. Willebeek-LeMair and A. Reeves, Strategies for dynamicload balancing on highly parallel computers, IEEE Transac-tions on Parallel and Distributed Systems4(9) (September,1993), 979–993.

[28] R. Wolski, Dynamically Forcasting Network Performance us-ing the Network Weather Service,Technical Report CS-96-494, U.C. San Diego, 1996.

Submit your manuscripts athttp://www.hindawi.com

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttp://www.hindawi.com

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation http://www.hindawi.com Volume 2014


Applied Computational Intelligence and Soft Computing

Advances in

Artificial Intelligence


Advances inSoftware EngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

http://www.hindawi.com Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in


dynamic load balancing of samr applications on distributed...

Documents