the master–slave paradigm on heterogeneous systems: a dynamic programming approach for the optimal...
TRANSCRIPT
![Page 1: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/1.jpg)
Journal of Systems Architecture 52 (2006) 105–116
www.elsevier.com/locate/sysarc
The master–slave paradigm on heterogeneous systems:A dynamic programming approach for the optimal mapping q
Francisco Almeida, Daniel Gonzalez, Luz Marina Moreno *
Dpto. Estadıstica, I.O. y Computacion, La Laguna University, La Laguna, Tenerife 38271, Spain
Received 30 June 2004; accepted 20 October 2004
Available online 23 March 2005
Abstract
We study the master–slave paradigm over heterogeneous systems. According to an analytical model, we develop a
dynamic programming algorithm that allows to solve the optimal mapping for such paradigm. Our proposal considers
heterogeneity due both to computation and also to communication. The optimization strategy used allows to obtain the
set of processors for an optimal computation. The computational results show that considering heterogeneity also on
the communication increases the performance of the parallel algorithm.
� 2005 Elsevier B.V. All rights reserved.
Keywords: Master–slave; Heterogeneous system; Optimal mapping; Dynamic programming
1. Introduction
Cluster Systems are playing an important role
in High Performance Computing due to the lowcost and availability of commodity PCs networks.
A typical resource pool comprises heterogeneous
commodity PCs and complex servers. The parallel
standard libraries allow an easy portability of the
1383-7621/$ - see front matter � 2005 Elsevier B.V. All rights reserv
doi:10.1016/j.sysarc.2004.10.005
q This work has been partially supported by the EC (FEDER)
and the Spanish MCyT (Plan Nacional de I+D+I, TIC2002-
04498-C05-05 and TIC2002-04400-C03-03).* Corresponding author. Fax: +34 922 31 92 02.
E-mail addresses: [email protected] (F. Almeida), dgon-
[email protected] (D. Gonzalez), [email protected] (L.M. Moreno).
parallel programs to the cluster contributing also
to the increasingly consideration of these architec-
tural solutions. However, a disappointing handi-
cap appears with the behavior of the realrunning time of the applications. Most of the pro-
grams have been developed under the assumption
of an homogeneous target architecture and this
assumption has been also followed by most of
the analytical and performance models [1,2]. This
models cannot be directly applied to the heteroge-
neous environments.
We revisit in this paper the mapping problemfor master–slave algorithms under this new heter-
ogeneous context. Some authors [3] broach the
same problem but consider heterogeneity due only
ed.
![Page 2: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/2.jpg)
106 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116
to the difference in processing speeds of worksta-
tions. However, it is a stated fact that the heteroge-
neity in the speed of workstations can have a
significant impact on the communication send/
recv overhead [4], even when the communicationcapabilities of the processors are the same. Our
proposals consider heterogeneity due both to com-
putation and communication, what provides a
very general overview of the system and a very
wide range of applicability.
The master–slave paradigm has been exten-
sively studied and modeled in homogeneous archi-
tectures [5–8]. However the modelization of thisparadigm on heterogeneous platforms is still an
open question. A very good work that studies
the scheduling of parallel loops at compile time
in heterogeneous systems was presented in [9].
They cover both homogeneous and heterogeneous
systems and regular and irregular computations,
giving optimal or near optimal solutions. An ana-
lytical model and a strategy for an optimal distri-bution of the tasks on a master–slave paradigm
has been presented in [10]. However the accuracy
of the predictions with their model is still to be
proved. In [3] the authors provide effective models
in the framework of heterogeneous computing re-
sources for the master–slave paradigm. They for-
mulate the problem and approach different
variants but they consider heterogeneity only dueto differences in the processing speed of the work-
stations. In [11] a very elegant structural model
has been presented to predict the performance of
master–slave algorithms on heterogeneous sys-
tems. The model is quite general and covers most
of the situations appearing in practical problems.
The structural model can be viewed as an skeleton
where the models for master–slave applicationswith different implementations or environments
must be instantiated. The main drawback of the
approach is that the instantiation of the model
for the particular cases is not a trivial task. It does
not find optimal distribution of works among the
processors. In [12] we presented an analytical
model that predicts the execution time of master–
slave algorithms on heterogeneous systems. Themodel includes heterogeneity due both to compu-
tation and to communication. Our proposal covers
some of the variants studied in [3,11]. The compu-
tational results prove the efficiency of our scheme
and the accuracy of the predictions.
The scheduling problem for master–slave com-
putations has been extensively studied in many
of the aforementioned works, however, most ofthese works consider the problem of the distribu-
tion of work over a set of fixed processors. We
claim that to solve the optimal scheduling, the
set of processors must be considered also as a
parameter in the minimization problem. The pre-
dictive ability of the model that we propose in
[12] constitutes the basis for further studies for
the master–slave paradigm. The main contributionof this work is the development of an optimal
mapping strategy for master–slave computations
based in the analytical model introduced in [12].
In particular, this analytical model allows to for-
mulate the problem as a resource optimization
problem that has been studied in depth [13].
According to this formulation we derive strategies
for the optimal assignment of tasks to processors.Our proposal improves the classical approach of
assigning tasks of size proportional to the process-
ing power of each processor and includes the ele-
ments to obtain the set of processors for an
optimal computation.
The paper is organized as follows. Section 2
introduces the heterogeneous network model and
some notation used along the paper. Section 3specifies the problem and describes an inductive
analytical model that predicts the running time
of a master–slave program in an heterogeneous
cluster of PCs for a given distribution of work.
This analytical model has been validated in Sec-
tion 4. The problem of finding the optimal distri-
bution of work among the processors (the
mapping problem) is studied in Section 5. Basedon the analytical model presented in Section 3 we
propose a general approach that formulates the
mapping problem as a particular instance of the
Resource Allocation Problem (RAP). We solve
this optimization problem using a dynamic pro-
gramming algorithm. The dynamic programming
approach allows also to solve the problem of find-
ing the optimal number of processors. The compu-tational results presented in Section 6 prove that
considering heterogeneity on the communication
capabilities increases the quality of the predictions.
![Page 3: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/3.jpg)
F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116 107
The paper is finished in Section 7 with some con-
cluding remarks and future lines of work.
2. Heterogeneous network model
A Heterogeneous Network (HN) can be ab-
stracted as a connected graph HN(P, IN) [14]
where:
• P = {P1, . . . , Pp} is the set of heterogeneous
processors (p is the number of processors).
The computation capacity of each workstationis determined by the power of its CPU, I/O
and memory access speed.
• IN is a standard interconnection network for
the processors where the interconnection links
between any pair of the processors may have
different bandwidth.
Let M be the problem size, expressed as thenumber of basic operations needed to solve the
problem, and let Tsi be the response time of the se-
rial algorithm for processor i and let Ti be defined
as the time elapsed between the beginning and the
end of the parallel execution in processor i. The
parallel run time T par ¼ maxpi¼1T i is the time that
elapses from the instant that a parallel computa-
tion starts until the last processor finishes the exe-cution [15].
The average Computational Power (CPi) of
processor i, in a heterogeneous system, can be de-
fined [16] as the amount of work finished during a
unit time span in that processor. CP depends on
the node�s physical features but also on the partic-
ular algorithm implementation that is actually
being processed. The average Computational
Power for processor i under a specific workload
M can be expressed as: CP i ¼ MT si. From a practical
point of view, the average Computational Power in
a set of processors can be calculated by executing a
sequential version of the algorithm under consider-
ation in each of them, using a problem size large
enough to prevent estimation errors due to cache
memory effects.Several authors provide definitions for the
speedup of the heterogeneous network [17], we
follow the definition Sp ¼min
pi¼1
T si
T p, where the speed-
up is calculated using the faster processors of the
network as the reference processor [18].
The cost of a communication in the target
architecture is an important factor to be consid-
ered. For the sake of the simplicity it is commonlyassumed that the cost of a communication can be
modeled by the classical linear model b + sw. band s stands for the latency and per-word transfer
time respectively and w represents the number of
bytes to be sent. This communication cost involves
the time for a processor to send data to a neighbor
and for it to be received. Although according to
the LogP cost model some other parametersshould be considered, it is also a commonly ac-
cepted fact that, due to the current routing tech-
niques and to the small diameter of the network,
the inter-processor distance can be neglected [19].
In the heterogeneous network where the communi-
cation capabilities may be different among proces-
sors, the parameters b and s must be computed
independently for any pair of processors.
3. The master–slave paradigm
Under the master–slave paradigm it is assumed
that the work W, of size m, can be divided into a
set p of independent tasks work1, . . . , workp, of
arbitrary sizes m1, . . . , mp,Pp
i¼1mi ¼ m, that canbe processed in parallel by the slave processors
1, . . . , p. We abstract the master–slave paradigm
by the following code:
/* Master Processor */
for (proc = 1; proc <= p; proc++)
send(proc, work[proc]);
for (proc = 1; proc <= p; proc++)
receive(proc, result[proc]);
}
/* Slave Processor */
receive(master, work);
compute(work, result);
send(master, result);
The total amount of work W is assumed to beinitially located at the master processor, processor
p0. The master, according to an assignment policy,
divides W into p tasks and distributes worki to
![Page 4: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/4.jpg)
108 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116
processor i. After the distribution phase, the mas-
ter processor collects the results from the proces-
sors. Processor i computes worki and returns the
solution to the master. The master processor and
the p slaves are connected through a network, typ-ically a bus, that can be accessed only in exclusive
mode. At a given time-step, at most one processor
can communicate with the master, either to receive
data from the master or to send solutions back to
it. The results are returned back from the slaves as
a synchronous round robin.
We will denote by Ci the computing time of task
i on processor i. In the context of heterogeneoussystems, Ci depends on the size of the task and
on the Computational Power of the processor to
be assigned.
Since the heterogeneity is due to the differences
in speeds, both in computation and communica-
tion, not only the computing times are different
but also the transmission time from/to the master
processor may be different for the different slaves.The transmission time from the master processor
to the slave processor i will be denoted by Ri =
(bi + siwi) where wi stands for the size in bytes
associated to worki and bi and si represent the la-
tency and per-word transfer time respectively. This
communication cost involves the time for the mas-
ter to send data to a slave and for it to be received.
The latency and transfer time may be different onevery combination master–slave and they must be
calculated separately. The transmission time of the
solution from the slave to the master is obtained
by Si = (bi + sisi) where si denotes the size in bytes
of the solution associated to task worki. Again, this
communication cost involves the time for a slave
Fig. 1. Timing diagram: Ci + Si = Ri+1 + Ci+1. Left—A perfect loa
computation is well balanced but on the dark area Si cannot overlap
to send the results to the master and for it to be re-
ceived. The definitions done of Ri and Si consider
different sizes for the tasks and the results gener-
ated. We will denote by Ti to the time needed by
processor i to finish its processing. Ti includesthe time to compute the task, the time to receive
it (Ri) and to send the solution back(Si) and the
times for processors j = 1, . . . , i � 1 to receive their
tasks, R1, . . . , Ri�1.
The master–slave paradigm has been exten-
sively studied and many scheduling techniques
have been proposed both in the case of irregular
computations and in the case of heterogeneousarchitectures. A very good work on this topic
can be found in [9]. They cover most of the situa-
tions on homogeneous and heterogeneous plat-
forms and regular and irregular computations,
but they do not consider the scheduling for conten-
tion avoidance in heterogeneous platforms. In [10]
is presented a scheduling strategy to solve this
case. It can be considered a natural extension ofthe work developed in [9]. They consider that the
optimal scheduling is obtained when Ci + Si =
Ri+1 + Ci+1 (Fig. 1—Left). The underlying ideas
are the overlapping of computation and communi-
cation and a load balance among the processors.
The static schedule is obtained through the resolu-
tion of a linear system of p � 1 equations like the
former.However, when the number of processors is
large enough, the former strategy may generate
non-optimal mappings. In Fig. 1 (Right) we can
see as a solution to the former system of equation
yields to a non-optimal solution due to the com-
munication contention. The former solution as-
d balanced computation. Right—Wrong timing diagram, the
Ri.
![Page 5: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/5.jpg)
Fig. 2. Left—corrected timing diagram. Right—improved schedule.
Fig. 3. The optimal schedule uses 4 processors.
F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116 109
sumes that the first processor will finish the com-
putation after the last processor has started to
work. Fig. 2 (Left) shows the corrected timing
diagram and a corrected mapping where the con-
tention avoidance is considered. The wall clock
time is reduced using five processors and a differ-
ent distribution of work (Fig. 2—Right). How-
ever, the Fig. 3 represents the optimal mappingfor the considered instance, it shows the fact that
using the best distribution on five processors
is slower than the optimal schedule with four
processors.
We claim that to obtain the optimal mapping,
not only the size of the problem should be consid-
ered but the number of processors is an important
minimization parameter. In [12] we presented ananalytical model for heterogeneous systems that
considers both parameters to generate the optimal
mapping. This analytical model can be considered
a particular instantiation of the structural model
described in [11]. According to our model, the run-
ning time for the whole computation can be ob-
tained with Tpar = Tp, that can be inductively
calculated as
T p ¼ T p�1 þ Sp þmaxð0; dpÞ¼ T p�2 þ Sp�1 þ Sp þmaxð0; dp�1Þ þmaxð0; dpÞ
¼ T 1 þXp
i¼2
Si þXp
i¼2
maxð0; diÞ
¼ R1 þ C1 þXp
i¼1
Si þXp
i¼1
maxð0; diÞ
where d1 = 1 and, for i > 1,
di ¼gi if di�1 P 0
gi � jdi�1j in other case
(with
gi ¼ ðRi þ CiÞ � ðCi�1 þ Si�1Þ:
di holds the accumulated delay between the first
i � 1 processors and processor i and, gi stands
for the gap between processor i � 1 and processor
i.Fig. 4 illustrates the situation with three
processors:
g3 = (R3 + C3) � (C2 + S2) denotes the gap be-tween slaves 2 and 3,
d3 ¼g3 if d2 P 0
g3 � jd2j in other case
�
is the accumulated delay and, the total running
time is T3 = T2 + S3 + max(0, d3).
The next constraint of the model assures that no
overlap amongst the problem sending and data
collection is produced:
Xp
i¼2
Ri <¼ C1
![Page 6: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/6.jpg)
Fig. 4. Timing diagram: heterogeneous master–slave, p = 3.
Up—the slowest processor is the second one. Down—the
slowest processor is the last one.
110 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116
The following sections validate and prove the effec-
tiveness of our proposal. The optimal distribution
of work among the set of processors should be such
Table 1
Accuracy of the model: m = 950, m = 1500
p Configuration mFast mSlow
m = 950
3 F F S 475 475
3 F F S 634 316
5 F F F S S 238 237
5 F F F S S 317 158
9 S F F F F S S S S 113 112
9 S F F F F S S S S 150 75
m = 1500
3 F F S 750 750
3 F F S 1000 500
5 F F F S S 375 375
5 F F F S S 500 250
9 S F F F F S S S S 188 187
9 S F F F F S S S S 250 125
p is the number of processors and Configuration show the configuration
the Fast processors, mSlow represents the work assigned to the Slow pr
Model � Predicted shows the values obtained with the analytical mod
that the analytical complexity function, Tpar = Tp,
be minimized. An analytical approximation to
minimize this function becomes extremely difficult.
Instead of an analytical approach we will perform
an algorithmic proposal to obtain the minimum.
4. Validating the analytical model
We develop in this section a computational
experience to validate our analytical complexity
function. We consider as test problem a typical
academic example used along the literature, themaster–slave approach to solve a matrix product.
The sequential algorithm implemented is the basic
O(m3) one. For the parallel matrix product A · B,
we have considered squared matrices of size m, the
matrix B has been previously broadcasted to the
whole set of slaves and mi rows of matrix A should
be sent to processor i, withPp
i¼1mi ¼ m.The computational experience has been devel-
oped under a cluster of PCs. The cluster has two
groups of processors, the first one is composed
by four AMD Duron (tm) 800 MHz PCs with
256 MB memory. These machines are referred as
fast nodes (F). The second group of processors,
the slow nodes (S), contemplates six AMD-K6
(tm) 500 MHz PCs with 256 MB memory. All
RealTime Model � Predicted Error
214.99 221.15 �0.029
143.02 147.28 �0.0298
109.94 111.18 �0.011
73.53 74.98 �0.020
45.39 47.45 �0.045
30.44 32.71 �0.075
852.29 884.90 �0.038
568.41 592.73 �0.043
442.76 445.09 �0.005
294.88 299.47 �0.016
220.33 224.25 �0.018
147.64 153.37 �0.039
of the processors selected. mFast represents the work assigned to
ocessors, RealTime shows the experimental time in seconds and
el for the selected distribution.
![Page 7: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/7.jpg)
Table 2
Sequential running time ratio Slow/Fast (S/F)
Size 100 150 200 250 300 1800 1900
S/F 2.37 1.92 2.07 1.75 1.89 2.17 1.79
The sequential algorithm has been executed for squared
matrices of different sizes.
Table 3
Running times of the sequential algorithm on the faster
machine
m 900 1000 1500
Sequential time 71.65 98.78 616.10
F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116 111
the machines run Debian Linux Operative System.
They are connected through a 100 Mbit/s fast
Ethernet switch.
To characterize the Computational Power of the
machines, we execute the sequential algorithm in
the different nodes. Tables 2 and 3 shows the ratios
amongst these executions for several instances. A
ratio of 2 is considered as the average value. Theheterogeneity of the cluster is clearly stated through
theComputational Power. This value is employed in
proportional data distribution layouts.
The analytical model presented in Section 3 uses
architectural dependent parameters as the comput-
ing time (Ci for processor i) and the parameters re-
lated to the communication network (latency and
bandwidth).Since Ci depends on the Computational Power
of the processors, in order to validate our pro-
posal, a wider execution of the serial matrix prod-
uct algorithm in the different types of processors is
performed. Fig. 5 plots these running times when
Fig. 5. The computing time Ci.
varying the size of the matrices. Ci is estimated
from the sequential execution for several sizes.
We have also characterized point to point com-
munications on the various type of processors in
the heterogeneous network. The blocking MPI_Send and MPI_Receive calls have been used in the
ping–pong experiment. The linear fit of the sending
overhead is plotted as a function of the message
sizes for combinations of both the slow and fast
nodes of our testbed architecture. Since the para-
meters are measured using a ping–pong the values
obtained for the Fast–Slow and Slow–Fast combi-
nations are considered equal. Fig. 6 shows theasymptotic effect of the parameters b and s.
Table 1 collects the computational results of the
experiment developed to validate our model. We
measure the real time expressed in seconds (Real
Time) obtained by an execution of the parallel
algorithm, the time predicted by the model
(Model � Predicted) and the relative error made
by the prediction for different data distributionlayouts using various configurations of fast/slow
processors. We have considered configurations
involving 2, 4 and 8 slaves. Column labeled Config-
uration in the table shows the configuration of pro-
cessors selected, F stands for a fast processor and S
for a slow processor. The first processor of the
configuration is the master processor. In Table 1,
mFast denotes the number of rows assigned to a fastprocessors and mSlow denotes the number of rows
Fig. 6. Linear fit for the transfer time for point-to-point
communication using different types of processors. S = slow
processor, F = fast processor.
![Page 8: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/8.jpg)
112 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116
assigned to a slow processors. Two series of exper-
iments were developed. The first one considers
mFast = mSlow. The second one assumes that mi
should be proportional to the processing speed
and assigns mFast � 2mSlow since we have experi-mentally verified that, for the problem considered,
this is the ratio between the processing speeds of
the fast and slow processors. The accuracy of the
model is confirmed by the very low error made
in most of the cases. The relative error made is
not greater than 7.5% using eight slaves on the
proportional distribution of the problem with size
m = 950.
5. An exact method for the optimal mapping
The predictive capabilities of the analytical
model presented in Section 3 can be used to predict
the sizes of the parameters for an optimal execu-
tion, i.e., to predict the optimal amount of workto be assigned to each slave processor providing
the minimum execution time. To be more precise,
we have to find sizes m1, . . . , mp for the tasks
work1, . . . , workp such that the analytical complex-
ity function Tp reaches the minimum. This section
describes an exact method that solves the general
optimization problem.
We will now formulate the problem of findingthe optimal distribution of work as a Resource
Allocation Problem and we will solve this resource
allocation problem through a dynamic program-
ming algorithm with complexity O(pm2).
The classical Resource Allocation Problem
[20,13] asks to allocate limited resources to a set
of activities to maximize their effectiveness. The
simplest form of such problem involves only onetype discrete resource. Assume that we are provided
M units of an indivisible resource and N activities
and for each activity i, the benefit (or cost) function
fi(x) gives the income (or penalty) obtained when a
quantity x of resource is allocated on the activity i.
Mathematically it can be formulated as
minXNi¼1
fiðxiÞ
s:t:XNi¼1
xi ¼ M and a 6 xi
where a is the minimum number of resource units
that can be allocated to any activity.
The schedule of a master–slave computation can
be managed as a Single Resource Allocation Prob-
lem. We consider that the units of resource in theformer problem are now the sizes of subproblems,
mi, that can be assigned to a processor i, and the
activities are represented by the processors, i.e.,
the number of activities N is equal to the number
p of processors, N = p. The assignment of one unit
of work to a processor means to assign a task work
of size one. The cost functions can be expressed in
terms of the analytical complexity function, f1 = T1
and fi = Si + max(0, di) so thatPp
i¼1fi ¼ T p. Then,
the optimization problem turns into
min T par ¼ T p
s:t:Xp
i¼1
mi ¼ m and 0 6 mi
That means that we propose to minimize the total
running time of the parallel algorithm, Tpar, on the
p processor machine, assuring that the total
amount of work,Pp
i¼1mi ¼ m, has been distributedamongst the set of processors. A solution to the
former optimization problem will provide an opti-
mal schedule for the master–slave computation.
The assignment of a value mi = 0 to processor i will
imply that considering processor i will not reduce
the running time of the computation.
Now, the dynamic programming strategy to
solve the optimization problem is presented. Dy-namic programming is an important problem solv-
ing technique used to solve combinatorial
optimization problems. The underlying idea of dy-
namic programming [20] is to avoid calculating the
same thing twice, usually by keeping a table of
known results that fills up as the subproblems
are solved. Dynamic Programming on the other
hand is a bottom-up technique. Usually it startswith the smallest, and hence the simplest subprob-
lems. By combining their solutions, the answers to
subproblems of increasing size are obtained until
finally the solution of the original instance is com-
puted. Dynamic programming is an algorithm de-
sign method that can be used when the solution to
a problem can be viewed as the result of a sequence
of decisions. The method enumerates all decisionsequences and pick out the best.
![Page 9: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/9.jpg)
F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116 113
Let us denote by G[i][x] the minimum running
time obtained when using the first i processors
to solve the subproblem of size at most x,
i = 1, . . . , p and x = 1, . . . , m. The optimal value
for the general master–slave problem is repre-sented by G[p][m], i.e., to solve the problem of size
m, using p processors. The principle of optimality
holds for this problem [20,21] and the following
recurrence equation will supply the values for
G[i][x]:
G½i�½x� ¼ minfG½i� 1�½x� j� þ fiðjÞ=0 < j 6 xgi ¼ 2; . . . ; p
G½1�½x� ¼ f1ðxÞ; 0 < x 6 m and
G½i�½x� ¼ 0; i ¼ 1; . . . ; p; x ¼ 0
The problem can be solved by a very simple multi-
stage dynamic programming algorithm in time
O(pm2) that can be very easily parallelized follow-
ing the approach presented at [21] with a complex-
ity O(p + m2) using p processors. Note that noparticular consideration has been done about the
type of the processor considered. The formulation
obtained is general and the complexity of the dy-
namic programing algorithm is the same if the het-
erogeneous system is composed of processor with
completely different computational capabilities.
To obtain the optimal mapping for a given prob-
lem and environment, the user just need to run thisalgorithm using as input parameters the set of pro-
cessors to be considered, and the architectural and
problem dependent constants.
An important advantage of the dynamic pro-
gramming approach is that the topic of sensitivity
analysis can be applied to the mapping problem
and the effect of allocating a unit of resource to
a processor can be deeper analyzed. The optimalnumber of processors to be used in an optimal exe-
cution can be obtained as a direct application of
the sensitivity analysis.
The set of processor to be used in the computa-
tions is determined by the amount of work assigned
to them. Those processors receiving an assignment
of 0 rows to compute are excluded from the compu-
tation. The optimal number of processors poptimal
can be detected as the number processors in this
set. Assumed that the faster processors should be
the processors receiving the highest amount of
work, if the set of processors is arranged according
to their computational power (a non-increasing
arrangement) poptimal is obtained as the first proces-
sor poptimal such that poptimal + 1 does not receive
any resource to compute. Note that in this hetero-geneous context, to develop a realistic sensitivity
analysis the total set of processors must be consid-
ered when computing the dynamic programming
table, see Table 6 in Section 6.
6. Computing the optimal mapping
We develop in this section a computational
experience to verify that the assignment strategies
proposed by our technique introduce a significant
improvement. We contrast the running time ob-
tained when using the distribution of work pro-
vided by our Dynamic Programming procedure
against distributions of work that only consider
the computational power of the processors.We use again as test problem the master–slave
approach to solve a matrix product. Square matri-
ces of sizes m = 900, 1000, 1500 have been consid-
ered. The speedups have been calculated using the
basic O(m3) algorithm executed over the faster
machine, even when slow processors are intro-
duced in the parallel computation. The speedup
has been calculated as the ratio of the serial runtime on the fastest machine to the time taken by
the parallel algorithm on the set of processors
considered.
A schedule that does not consider heterogeneity
due to differences on the communication capabili-
ties, would consider a distribution of work propor-
tional to the processing speeds of the processors. In
Section 4 has been stated that for thematrix productproblem in our cluster, the Computational Power of
a fast machine is approximately a factor of two
times the Computational Power a slow machine.
Table 4 shows the running time of the parallel
algorithm for the proportional and optimal distri-
bution (columns labeled PTime and OTime respec-
tively) and the execution time of the program that
computed the optimal distribution (column la-beled Schedule Time). Table 5 shows the sizes of
the subproblems assigned to the fast processors
(mFast) and to the slow processors (mSlow) for the
![Page 10: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/10.jpg)
Table 6
Sensitivity analysis for a small problem m = 100
Processors Distribution Predicted Time
10 Fast mFast = 10 0.03891
10 Fast mFast = 10 0.03891
1 Slow mSlow = 0
10 Fast mFast = 9 0.03869
10 Slow mSlow1¼ mSlow2
¼ 5
mSlowi ¼ 0, i = 3, . . . , 10
Fig. 7. Relative improvements obtained for different sizes of
problems. Data have been extracted from column Relative
Improvement in Table 4.
Table 4
Running times of the parallel algorithm
m PTime OTime Relative
Improvement (%)
Schedule
Time
p = 3 Configuration: F F S
900 116.45 79.27 31.93 0.87
1000 161.43 105.01 34.95 1.08
1500 576.02 448.09 22.21 2.49
p = 5 Configuration: F F F S S
900 60.21 46.81 22.26 2.62
1000 85.54 63.43 25.85 3.23
1500 287.89 217.19 24.56 7.37
p = 7 Configuration: F F F F S S S
900 40.64 34.12 16.04 4.37
1000 57.63 44.81 22.26 5.40
1500 179.38 148.01 17.49 12.32
p = 9 Configuration: S F F F F S S S S
900 29.85 26.69 10.59 6.11
1000 43.43 35.73 17.73 7.55
1500 146.32 109.92 24.88 17.20
114 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116
proportional and optimal distribution (columns la-
beled PSize and OSize respectively).
The percentage of relative improvement
achieved is also presented (Relative Improvement
in Table 4 and Fig. 7) and has been calculated asðPTimeÞ�ðOTimeÞ
ðPTimeÞ � 100. For all the sizes considered,
obviously, the optimal distribution always presents
the better behavior. A relative improvement of
34.95% is observed for the problem m = 1000 using
Table 5
Predicting the parameters for an optimal execution
m PSize OSize
p = 3 Configuration: F F S
900 mFast = 600 mFast = 701
mSlow = 300 mSlow = 199
1000 mFast = 666 mFast = 794
mSlow = 334 mSlow = 206
1500 mFast = 1000 mFast = 1191
mSlow = 500 mSlow = 309
p = 7 Configuration: F F F F S S S
900 mFast = 200 mFast = 216
mSlow = 100 mSlow = 84
1000 mFast = 222 mFast = 216
mSlow = 111 mSlow = 88�88�86
1500 mFast = 334 mFast = 399
mSlow = 166 mSlow = 101
p = 3 processors. Fig. 8 shows the speedup ob-
tained using the two distributions with m =
1000, 1500 for all the configurations of processors.
Note that as an heterogeneous environment is
PSize OSize
p = 5 Configuration: F F F S S
mFast = 300 mFast = 333
mSlow = 150 mSlow = 117
mFast = 334 mFast = 370
mSlow = 166 mSlow = 130
mFast = 500 mFast = 585
mSlow = 250 mSlow = 165
p = 9 Configuration: S F F F F S S S S
mFast = 150 mFast = 158
mSlow = 75 mSlow = 67
mFast = 166 mFast = 181
mSlow = 84 mSlow = 69
mFast = 250 mFast = 295
mSlow = 125 mSlow = 80
![Page 11: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/11.jpg)
Fig. 8. Speedups obtained with the parallel algorithm for
different sizes of problems and different configurations of
processors.
F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116 115
under consideration, on every set of processors fast
and slow machines have been included. This fact
explains the speedups observed, not so impressive
as they use to be for matrix multiplication algo-
rithms in homogeneous architectures. However,the illustration reflects an important increase of
the speedup for the scheduling strategy provided
for the Dynamic Programming algorithm.
Table 6 allows to develop a sensitivity analysis
to determine the optimal number of processors
for a small problem, m = 100. Let us assume that
we have been provided with a set of 10 fast proces-
sors and 10 slow processors. The first row of thetable shows the optimal distribution considering
just the first 10 fast processors, the second row
considers 10 fast processors and 1 slow processor
and the third one considers the whole set of
processors. The table also provides the execution
time predicted by the model for that distributions.
The first situation reflects the homogeneous
case where 10 identical processors are considered,the optimal distribution obviously comes from a
balanced distribution. The second situation con-
siders the fact that to introduce just one slow
processor does not improves the computation,
the slow processor would not receive any amount
of work at all. When considering the total set
of processors provided (the third row) the opti-
mal distribution assigns work just to two slowprocessors.
We have also applied our proposals to the bidi-
mensional Fast Fourier Transform (FFT-2D).
This is a problem that plays an important role in
many scientific and engineering applications. The
FFT-2D exhibits a ratio computation to commu-
nication quite different to the matrix multiplication
problem. The complexity order of the data to be
sent in this case, is the same as the complexity
order of the data to be computed. Using our ap-
proach we tune a master–slave FFT-2D algorithmto be efficiently executed in a heterogeneous cluster
of PCs.
7. Concluding remarks and future work
We have proposed an algorithmic approach to
solve the optimal mapping problem of master–slave algorithms on heterogeneous systems. The
algorithmic approach considers an exact dynamic
programming algorithm based on an inductive
analytical model. The dynamic programming algo-
rithm solves the general resource allocation opti-
mization problem involved into a master–slave
computation. Through the sensitive analysis, the
optimal set of processors can also be determined.There are several challenges and open questions
to broach in the incoming future. Although the
class of master–slave problems considered repre-
sents a wide range of problems, the master–slave
programs where the results are accepted from the
slave as soon as they are available, are not well
modeled with this model. We plan to develop the
same methodology with the aim of covering thisclass of problems also. This methodology can be
introduced into a tool to provide the optimal map-
ping of master–slave algorithms over heteroge-
neous platforms following the algorithmic
strategies proposed.
References
[1] S. Browne, J. Dongarra, N. Garner, G. Ho, P. Mucci, A
portable programming interface for performance evalua-
tion on modern processors, The International Journal of
High Performance Computing Applications 1 (2000) 189–
204.
[2] J. Gonzalez, C. Rodrıguez, G. Rodrıguez, F. de Sande, M.
Printista, A tool for performance modeling of parallel
programs, Scientific Computing (2003) 191–198.
[3] O. Beaumont, A. Legrand, Y. Robert, The master–slave
paradigm with heterogeneous processors, Rapport de
recherche de l� INRIA-Rhone-Alpes RR-4156, 2001, 21 p.
![Page 12: The master–slave paradigm on heterogeneous systems: A dynamic programming approach for the optimal mapping](https://reader035.vdocuments.us/reader035/viewer/2022080102/575020611a28ab877e9a7978/html5/thumbnails/12.jpg)
116 F. Almeida et al. / Journal of Systems Architecture 52 (2006) 105–116
[4] M. Banikazemi, J. Sampathkumar, S. Prabhu, D.K.
Panda, P. Sadayappan, Communication modeling of
heterogeneous networks of workstations for performance
characterization of collective operations, in: International
Workshop on Heterogeneous Computing (HCW�99), in
Conjunction with IPPS�99, 1999, pp. 159–165.[5] G. Jones, M. Goldsmith, Programming in Occam 2,
Prentice Hall, 1988.
[6] P. Hansen, Studies in Computational Science, Prentice
Hall, 1995.
[7] J. Roda, C. Rodrıguez, F. Almeida, D. Morales, Prediction
of parallel algorithms performance on bus based networks
using PVM, in: Sixth Euromicro Workshop on Parallel and
Distributed Processing, 1998, pp. 57–63.
[8] D. Turgay, Y. Paker, Optimal scheduling algorithms for
communication constrained parallel processing, in: Euro-
par2002Lecture Notes in Computer Science, vol. 2400,
Springer-Verlag, 2002, pp. 197–211.
[9] M. Cierniak, M. Zaki, W. Li, Compile-time scheduling
algorithms for heterogeneous network of workstations,
The Computer Journal 40 (60) (1997) 236–239.
[10] M. Drozdowski, P. Wolnievicz, Experiments with sched-
uling divisible tasks in clusters of workstations, in: Euro-
par2000Lecture Notes in Computer Science, vol. 1900,
Springer-Verlag, 2000, pp. 311–319.
[11] J. Schopf, Structural prediction models for high-perfor-
mance distributed applications, in: Cluster Computing
Conference, 1997.
[12] F. Almeida, D. Gonzalez, L. Moreno, C. Rodriguez, J.
Toledo, On the prediction of master–slave algorithms over
heterogeneous clusters, in: Eleventh Euromicro Conference
on Parallel, Distributed and Network-Based Processing,
2003, pp. 433–437.
[13] T. Ibaraki, N. Katoh, Resource Allocation Problems.
Algorithmic Approaches, The MIT Press, 1988.
[14] X. Zhang, Y. Yan, Modeling and characterizing parallel
computing performance on heterogeneous networks of
workstations, in: 7th IEEE Symposium on Parallel and
Distributed Processing (SPDP�95), 1995, pp. 25–34.[15] V. Kumar, A. Grama, A. Gupta, G. Karypis, Introduction
to Parallel Computing Design and Analysis of Algorithms,
The Benjamin/Cummings Publishing Company, 1994.
[16] L. Pastor, J. Bosque, An efficiency and scalability model
for heterogeneous clusters, in: 2001 IEEE International
Conference on Cluster Computing (CLUSTER 2001),
2001, pp. 427–434.
[17] Y. Yan, X. Zhang, Y. Song, An effective and practical
performance prediction model for parallel computing on
nondedicated heterogeneous NOW, Journal of Parallel and
Distributed Computing 41 (2) (1997) 63–80.
[18] V. Donaldson, F. Berman, R. Paturi, Program speedup in
a heterogeneous computing network, Journal of Parallel
and Distributed Computing 21 (3) (1994) 316–322.
[19] D. Culler, K. Richard, D. Patterson, A. Sahay, K.
Schauser, E. Santos, R. Subramonian, T. von Eicken,
LogP: towards a realistic model of parallel computation,
in: 4th ACM SIGPLAN, Sym. Principles and Practice of
Parallel Programming, 1993.
[20] T. Ibaraki, Enumerative approaches to combinatorial
optimization, part ii, Annals of the Operational Research
11.
[21] D. Gonzalez, F. Almeida, J. Roda, C. Rodrıguez, From
the theory to the tools: parallel dynamic programming,
Concurrency: Practice and Experience (2000) 21–34.
Francisco Almeida is an Assistant Pro-
fessor of Computer Science at the
University of La Laguna in Tenerife
(Spain). His research interests lie
primarily in the areas of parallel com-
puting, parallel algorithms for optimi-
zation problems, parallel systems
performance analysis and prediction,
and skeleton tools for parallel pro-
gramming. Almeida received the BS
degree in mathematics and the MS and
PhD degrees in computer sciences from the University of La
Laguna.
Luz Marina Moreno is currently a PhD
student in Computer Science at the
Statistics, Operational Research and
Computer Science department at Uni-
versity of La Laguna. She now is
Assistant Professor at the same depart-
ment. She is mainly interested in heter-
ogeneous systems and in scheduling.