from bsp to a virtual von neumann machine

6
PARALLEL PROCESSING From BSP to a virtual Bulk synchronous parallel architecture incorporates a scalable and transparent communication model. The task-level synchronisation mechanism of the machine, however, is not transparent to the user and can be inefficient when applied to the co-ordination of irregular parallelism. This article presents a discussion of an alternative memory- level scheme which offers the prospect of achieving both efficient and transparent synchronisation.The scheme, based on a discrete event simulation paradigm, supports sequential style of programming and, coupled with the BSP communication model, leads to the emergence of a virtual von Neumann parallel computer. by N. Kalantery, S. C. Winter and D. R. Wilson scalable parallel computer consists of a set of computing nodes. Each node (NI) incorporates a processor-memory pair A (Pl,M,) and can perform processor and/or memory functions. Inter-nodecommunicationis achieved through a message passing network (Fig. 1). In transition from the von Neumann model of computing to a scalable parallel computing model, two fundamental problems are encountered. First, the parti- tioning of the physid machim into multiple computing nodes (processor-memorypairs) leads to latency in non- local memory access operations. Latency refers to the time during which a processor is forced to idle a s it waits for a response from a remote node of the machine. Second, the partitioning of the logical machine, i.e. the program, into multiple logical segments leads to the problem of synchronisation. Synchronisation refers to the temporal co-ordinationof related activities, such that the intended order of operations on the shared data is maintained. A general purpose parallel computing model is characterised by the type of solutions it provides to these two basic questions. Bulk synchronous parallel model The BSP (bulk synchronous parallel) model12has been put forward as a candidate for general purpose parallel computer model. BSP provides an automated solution for the first set of problems. The programmer is relieved of the consideration of data locality issues. Data is randomised across the memory nodes to avoid Occurrence of hot spots in the communication network. The router combined with Valiant’s hashing techniques and the use of parallel slack,’ presents an efficient single address space to the programmer. However, BSP’s solution to the second set of problems does not offer the same automatic facilities at the programming level. The programmer is required to keep NI N2 N3 message routing network I Fig. 1 Scalable parallel computer COMPUTING & CONTROL ENGINEERING JOURNAL JUNE 1995 131

Upload: dr

Post on 20-Sep-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From BSP to a virtual von Neumann machine

PARALLEL PROCESSING

From BSP to a virtual

Bulk synchronous parallel architecture incorporates a scalable and transparent communication model. The task-level synchronisation mechanism of the machine, however, is not transparent to the user and can be inefficient when applied to the co-ordination of irregular parallelism. This article presents a discussion of an alternative memory- level scheme which offers the prospect of achieving both efficient and transparent synchronisation. The scheme, based on a discrete event simulation paradigm, supports sequential style of programming and, coupled with the BSP communication model, leads to the emergence of a virtual von Neumann parallel computer.

by N. Kalantery, S. C. Winter and D. R. Wilson

scalable parallel computer consists of a set of computing nodes. Each node (NI) incorporates a processor-memory pair A (Pl,M,) and can perform processor and/or

memory functions. Inter-node communication is achieved through a message passing network (Fig. 1).

In transition from the von Neumann model of computing to a scalable parallel computing model, two fundamental problems are encountered. First, the parti- tioning of the phys id machim into multiple computing nodes (processor-memory pairs) leads to latency in non- local memory access operations. Latency refers to the time during which a processor is forced to idle as it waits for a response from a remote node of the machine.

Second, the partitioning of the logical machine, i.e. the program, into multiple logical segments leads to the problem of synchronisation. Synchronisation refers to the temporal co-ordination of related activities, such that the intended order of operations on the shared data is maintained.

A general purpose parallel computing model is characterised by the type of solutions it provides to these two basic questions.

Bulk synchronous parallel model The BSP (bulk synchronous parallel) model12 has been

put forward as a candidate for general purpose parallel

computer model. BSP provides an automated solution for the first set of problems. The programmer is relieved of the consideration of data locality issues. Data is randomised across the memory nodes to avoid Occurrence of hot spots in the communication network. The router combined with Valiant’s hashing techniques and the use of parallel slack,’ presents an efficient single address space to the programmer.

However, BSP’s solution to the second set of problems does not offer the same automatic facilities at the programming level. The programmer is required to keep

NI N2 N3

message routing network

I Fig. 1 Scalable parallel computer

COMPUTING & CONTROL ENGINEERING JOURNAL JUNE 1995 131

Page 2: From BSP to a virtual von Neumann machine

PARALLEL PROCESSING

close control on the distributed execution of the program? Provision of automatic bulk synchronous operation may reduce the burden of the programmer, but she still needs to ensure that during a given super step (i.e. between two global synchronisation points), only one program segment is writing to a given variable. The Same variable can be read concurrently by many processes, but only at another super step. This is similar to the PRAM- CREW (concurrent read exclusive write) formalism. On the othefhand, imposition of synchronous operation on a more powerful and flexible asynchronous machine has a number of performance drawbacks and in certain cases, e.g. computations with dynamic and irregular granularity, will lead to inefficient execution.

Hybrid data flow CHDFl model Hybrid data flow is another major candidate for a

general purpose parallel computing model. The early (pure) data flow model achieved implicit

synchronisation by routing the data through an instruction match and fetch unit. Computation granu- larity was fixed at instruction level and could not be increased to reduce communica- tion/ synchronisation overheads. Execution of inherently sequen- tial segments of the code was particularly inefficient as the cost of data-flow synchronisation was incurred, without the payback of parallelism. The data flow model required a radical shift from existing parallel computer architectures, which are generally built by interconnection of von Neumann type processor- memory pairs. The data flow programming model also called for a radical departure from the established methods. The programmer had to adhere to sinele assimment rules. When

barriers negates the advantages of implicit and more efficient memory-level synchronisation.

Memory-level synchronlratlon versus task-level synchronlratbn

In distributed memory multiprocessors, remote memory is always coupled with a processor. This is necessary to allow the memory to receive, buffer, interpret and respond to messages. Therefore, physical facilities for creation of an active memory are already in place. A memory node is no longer a passive storage device. It is often an exact replica of a processor node. Therefore, there is no reason why it could not be programmed to maintain its own coherency. Never- theless, task level synchronisation treats remote memory as a passive entity. Processor nodes are required to use execution barriers to maintain coherency of the data at the memory nodes. Apart from explicit synchronisation associated with this method, task-level techniques have also a number of performance drawbacks:

(i) Memory level synchronisation allows localised Der

Imposition of synchronous operation

on a more powerful asynchronous machine

has a number of performance

drawbacks and in certain cases will lead to inefficient execution

multiple assignments to the same variable were required, considerable programming effort was needed to realise that effect within the constraints of single assignment model.

Modern hybrid models are built on the principle of incorporating data-driven synchronisation into con- ventional scalable parallel architectures. The use of I-structure memory in these systems may be regarded as an implicit recognition of the active nature of the memory in scalable parallel computing. Unfortunately, the I-structure memory can only cope with single assignment variables and therefore the programmer is still faced with the same restriction as in the pure data flow model, except that, in the hybrid model, task-level barrier synchronisation can be used to implement multiple assignments? However, use of task-level

datum synchronisation as cOm- pared to the wholesale sequen- cing of operations imposed by task-level methods. The latter, except in the special case of regular data parallel applications, is bound to over-constrain available parallelism. (ii) When synchronisation is enforced at the remote memory itself, tasks can asynchronously dispatch READNRITE mess- ages, yielding higher level of concurrent communication. In contrast, task-level barriers solve the ordering problem by preventing tasks from sending a READNRITE message until an amrouriate svnchronisation

event is received. This- tends to s&ke message transmission. (iii) Context switching represents a significant overhead in multi-threaded execution. In task-level synchronisa- tion, context switching has to be performed at barriers as well as remote memory READ points, whereas memory level synchronisation requires context switching per remote memory READ only. Thus task-level synchro- nisation results in a higher number of context-switching. (iv) Synchronisation overheads at memory nodes may increase the latency of remote memory access operations and hence extra amount of parallel slack may be required to hide the increased latency (due to synchronisation plus communication latencies). To reduce the ordering workload of the memory nodes and to allow faster response times, data can be distributed amongst a larger

132 COMPUTING & CONTROL ENGWEERING JOURNAL JUNE 1995

Page 3: From BSP to a virtual von Neumann machine

PARALLEL PROCESSING

number of memory nodes, certain increase in communication latency will result. A more effective solution is to maintain a lower bound on the mean task granularity. This will tend to keep the workload of the remote memory down. A case can be made to show that, by maintaining an appropriate lower bound on mean granularity, one may hide the synchronisation costs altogether.

Wrtual w n Neumann model In this article we introduce a memory level

synchronisation scheme, adopted from parallel discrete event simulation (FDES) systems, which promises to provide an efficient and completely automated program control mechanism for parallel and distributed computations. Replacement of the synchronous operation of the ESP with the proposed automatic partial ordering mechanism leads to the emergence of a virtual serial computer (Virtual von Neumann (VvN) machine4)*. The VvN model provides a transparent parallel execution environment for sequential programs.

The programming model, as is customary in conventional sequential compu-

paradigm lies the idea of simulated time. The paradigm consists of three main elements: (i) logical clocks, (ii) time stamp of messages and (iii) the active order-preserving memory. Each task process (a thread of computation) owns its own private clock which is initially set and defined at compile time. Clocks are updated throughout the execution of the task. READlwRITE messages sent by the tasks are ‘stamped’ with the up-to-date value of the task clock. The self-ordering memory uses the time stamp of the messages to maintain a coherency condition.

The coherency condition requires that the value returned on a READ message must be the value given by the ‘latest’ WRITE message for that memory location? Temporal relations such as ‘latest’, ‘before’, ‘after’ etc. are defined in terms of the logical time.

An application program prepared for execution in the proposed environment consists of a set of sub-programs (i.e. code blocks, procedures, sub-routines etc.). A sub- program may have its own private data which is placed in the local passive memory. Data common to two or more sub-programs is placed in the active memory. Although all sub-programs will be executed concurrently, there is

no need for explicit synchro- ting, requires correct sequential nisation. The self-ordering- ordering amongst program The distinguishing memory guarantees its own

coherency by ensuring that feature of this operations on a given datum

statements. The sequential style of programming is in essence a very powerful and yet intuitive programming model occur in the increasing order of way of synchronisation specifi- message time stamps. Within the cation. Sequential program is that the DES context, this requirement is structures imph a total ordering known as the causali@ constraint relation amongst program state- programmer must where the timestamps areused to ments. But, by using per datum indicate how the maintain correct ordering memory level synchronisation, amongst simulation events. How-

ever, utilisation of a PDES paradigm for (none DES) general automatically loosened and

localised, to the mutual ordering partition the purpose computing requires, first of only those steps which become of all, a logical time system. observable to a common data item. program Simulated time used in PDES is An execution step becomes explicitly provided by the observable to a data item only application, i.e. the simulation when it tries to access or modify that data item. Hence model. In contrast, general purpose programs have no maximum available parallelism is released while correct explicit notion of time. ordering of interdependent activities is preserved. The concept of time is a derivative of the more basic

The distinguishing feature of the programming model concept of the sequential order in which events occur! of the virtual serial machine compared to a single That order is readily given by the sequential style of processor machine is that the programmer must now programming. In the next Section we use a model indicate how the machine is to partition the program. sequential program structure to illustrate how a logical Othenvise, all aspects of organising communication and clock system for the given model can be extracted at the synchronisation amongst program segments are carried compilation stage. The clock system is then used to mark out by the VvN automatically. the progress of logical time, as execution proceeds.

Automatic synchronisation amongst parallel segments Parallelisation of order preservation mechanism of is achieved during the execution by adopting a discrete DES programs has led to the development of distributed event simulation (DES) paradigm. At the core of the DES partial ordering algorithms. Two main strategies for

order preservation, i.e. conservative and optimistic *International patent rights resenred for the university of approaches, have evolved. A comprehensive survey Westminster of PDES algorithms and their performance

the total ordering of execution is machine is to

COMPUTING & CONTROL ENGINEEWG JOURNAL JUNE 1995 133

Page 4: From BSP to a virtual von Neumann machine

PARALLEL PROCESSING

program and the inter-relationship amongst the loops. Therefore, to avoid the irrelevant detail, we represent a loop by means of a graphical notation [I. ....~ Using this notation, interrelationship amongst multiple loops constituting a sequential program can be described.

Fig. 3 illustrates an example program structure and corresponding system of clocks which have

Fig. 2 A CONDITIONAL loop converted to a COUNTER loop been extracted. As the figure indicates, loop 10

encloses all other loous. ~ D S 12 and I? are nested in

. . . loop-body . . . . . . IOOp-body . . .

I

characteristics are available in Reference 7. In the section on 'Programming the memory' we offer a

brief discussion of the way in which the PDES paradigm is efficiently adopted to achieve an order preserving active memory for conventional applications.

Extracting the clock system Programs using a significant amount of computer time

are generally constructed in the form of loops, and loops often contain the greatest amount of parallelism within a program. Besides, any non-looping section of the code could be viewed as a loop but with only a single iteration. Therefore a general model of sequential programs can be defined in terms of consecutive or nested loops alone.

We can further simplify our presentation by specifying that all our loops are counter loops of the C 'FOR loop' type. A conditional loop can be easily converted to a counter loop by setting the bound value of the counter to infinity and transferring the convergence condition into the body of the loop, as in Fig. 2. In extracting the clock system, we only need the loop control structures of the

loop 11 but are not nested-with r&p& to each other. We say L is consecutive with respect to 12. Loop L is enclosed only by loop b. Loop G, is in the same nesting level as loops 11 and L and is consecutive with respect to them.

Loop clocks presented in the same Figure illustrate a simple way of building and initialising the clock system to reflect the sequential execution order of individual iterations of each loop. In this method, each loop is represented by a distinct clock (in effect each additional loop introduces a further timing axis and hence calls for creation of a new clock) and the nesting of the loops is reflected in the creation of a hierarchical system of clocks such that the outermost nesting level takes the most significant position and the innermost nesting level takes the least significant position in the timing system. Note that Io, II,Iz, IS, L, 4 and L are loop-counter variables and would assume numerical values once the execution starts.

A clock is represented by two fields. The first field (If) is static and will remain the same throughout the execution. It indicates precedence relations of consecutive

PROGRAM STRUCTURE , LOOP CLOCK STRUCTURE

en represents a counter loop of the form: for ( In = base; In < bound; In += step) ' if (condition) break

1 loop-body-statements;

where base, bound and step are integer values

Fig. 3 Graphical representation of a model sequential program and corresponding clock structures

loops within the same nesting level. The second field (C), which is dynamic, represents the value of the loop counter. The advantage of this scheme is that it directly mirrors the program structure. However, it carries redundant information and may benefit from further manipulation, so that the same information could be compressed into a smaller space.

The value of the field N of a given clock is determined at the compile time, to indicate the intended order of execution of the associated loop with respect to other loops within the same nesting level. The value of the field C will emerge at the run time. As and when program execution causes loop counter update, the logical clock of the loop will accordingly get updated and hence the sequential order of the current iteration will be reflected in the value of its clock and in the time stamp of messages resulting from the related computations of the iteration.

Once the temporal co-ordinate of the program is extracted, the program can be partitioned and transformed into a multiplicity of parallel processes. A possible

COMPUTING & CONTROL ENGINEERING JOURNAL JUNE 1995

Page 5: From BSP to a virtual von Neumann machine

PARALLEL PROCESSING

I: 0 to N

spreading across

I a Pprocesses b

way of loop distribution is to assign interleaving iterations to consecutive processes. For example, given p processes, the nth process is assigned iterations n, n+p, n+2p, n+3p and so on. Thus a loop of the form illustrated in Fig. 4u, after distribution across p processes, may be transformed top loops of the form given in Fig. 4b.

The end result of this process is transformation of loops into independent processes, where each process carries its simple or nested logical clock initialised as described in the previous paragraphs.

In the simple example of Fig. 5, a sequential code comprising loop I has been partitioned and transformed into two parallel processes PI and Pz. In the original loop, counter I would take all the values from 0 to 10 (i.e. 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10). However, because of the initial values of I and its step values in PI and Pz, the counter values in PI and PZ will be:

at PI: I = 0,2,4,6,8, 10 at Pi I = 1, 3, 5, 7,9

the initial format and value of logical clocks at both processes will be <O,I>. A message sent by first iteration at PI will carry time stamp <O,O> while a message from first iteration of Pzwill be stamped <0,1>.

Programming the memory: preliminary conrideratlons

Discrete event simulation programs are characterised by their dynamic precedence relations. For example, in a DES program, the precedence relation amongst the computations produced within a loop may not be given by the order implied by the loop control construct. Instead, the execution order of a computation is determined by its occurrence time which may be totally independent from the serial order implied by the program structure. Also time in DES systems evolves stochasti-

Fig. 4(a) Single sequential loop spread across (b) equivalent P interleaving loops

cally and the occurrence time of a computation (e.g. a procedure call) cannot be determined a priori. Therefore, current distributed order preservation strategies implemented for PDES systems are tuned to situations where precedence relations cannot be determined until after they have emerged during the execution of the program. Therefore either a continuous run time scanning of the simulated time (by exchange of null messages) is used to ensure that an ordering error will never happen (the conservative approach) or instead, to save on the scanning costs, ordering errors are allowed to happen but provisions are made to detect and rectify them as and when they do (the optimistic approach).

The need for continuous search of time domain in conservative approaches implies that efficient execution is possible only if the search space is kept small. The notion of lookahead ratio provides a quantitative measure of the search space. Lookahead refers to the lower bound on the time span between two consecutive computational activities. If after each lookahead period, a computational activity isfound then we say the lookahead ratio is 1 to 1. But, since the lookahead is the lower bound on inter- activity period, it may happen that many lookahead periods are checked without encountering a computational activity. Thus we arrive at the notion of lookahead ratio, which is the ratio of lookahead to the mean inter-activity period. Conservative synchronisation of an application exhibiting 1 to 1 lookahead ratio would be achieved with no extra costs at all (compared to the sequential simulation). But as the lookahead ratio is reduced, synchronisation cost increases because now more and more search and test operations have to be made before a program activity can be found. Lookahead ratio of zero leads to complete deadlock, because the mechanism can never proceed to a next test point in the temporal search space.

Optimistic synchronisation offers a more efficient and

l a b

Fig. 5 Clock values in two interleaving loops

COMPUTING & CONTROL ENGINEERING JOURNAL JUNE 1995 135

Page 6: From BSP to a virtual von Neumann machine

PARALLEL PROCESSING

general purpose synchronisation for stochastic time systems. The optimistic methods charge a deterministic state saving overhead which is generally well compensated for by parallel execution.

Stochastic time systems and dynamic precedence relations, however, seem to be unique to the discrete event simulation programs. Most other sequential programs have staticprecedence relations (or static temporal graph) amongst their computation steps which can be determined at the compile time. This knowledge eliminates the need for exhaustive search or state saving costs which are inevitable when the parallel execution involves dynamic precedence relations.

Despite their static temporal graph, sequential programs often exhibit dynamic dependency relations. Dependency relations are of a spatial nature. The dependency graph of a program shows which memory location (i.e. a spatial point) will be accessed by a given computation step. Precedence graph or serial order of a progam shows which computation is to happen before or after another one. A conditional statement within a program may or may not happen but we can determine a priori that if it happens in which sequential step this would be. However, dynamic dependency relations imply that we do not know, at compilation time, which memory cell is going to be affected by a given program segment and which program segment is going to be effected by a given memory cell. It is in such cases that compile time techniques fail to parallelise the program into multiple independent segments! Parallel realisation of dynamic dependency graphs is in general a non-trivial activity and is at odds with both single assignment and PRAM- CREW programming models. The main challenge in developing a general purpose order preserving memory was to achieve efficient run time resolution of the dependencies without constraining available parallelism.

A common feature of most dynamic dependency relations is that the dependencies are characterised by a known space-time correlationT For instance, in many numerical algorithms, array indices (i.e. space) are a function of the loop iteration counter (i.e. time). This correlation is either characterised by a computable function or is given as a table of some form (usually another array structure). At some point during the execution the coefficients required for the solution of the function (or the actual values of the table) become available. Dynamic dependencies are resolved, by using this knowledge, at the memory. Hence exhaustive search and/or state saving costs associated with the PDES programs are avoided.

Concluslon and future work Any general purpose parallel computing model needs

to address two fundamental questions: (i) latency in remote memory access operations, and (ii) logical ordering of these operations. The BSP computer offers an

efficient and automated solution for the first problem. However, synchronisation mechanism suggested by the BSP model requires close control at the programming level and, when applied to irregular parallelism, can be inefficient.

In this article we have introduced an alternative synchronisation mechanism. We have shown how sequential structure of a program is used to generate a logical time system which is analogous to the simulated time in discrete event simulation.

We have discussed how the active nature of the memory in scalable parallel architectures can be exploited to provide, in much the same way as in parallel discrete event simulation systems, a distributed order preserving memory. Here READmTE messages constitute the event set of the computation.

We have discussed that space-time correlations in conventional serial programs are well defined and hence the ordering of memory access events can be achieved without the relatively high cost of ordering dynamic stochastic event times in PDES systems. This implies that a reasonable lower bound on computation granu- larity will hide the costs of order preservation in the memory altogether.

Finally, the article has presented that a combination of the communication model of the BSP computer and the proposed program control mechanism leads to the development of a virtual serial parallel computer (the Virtual von Neumann machine). Patents covering the principles of the system architecture are held by the University of Westminster and interested parties who wish to develop the product range are invited to contact the authors at the address given below.

Rot.nnms 1 VALIANT, L. G.: 'A bridging model for parallel computation', ACM,

1990.33, (8), pp.103-111 2 McCOLL, W. E: 'Bulk synchronous parallel computing', 2nd Abstract

Machine Workshop, University of Leeds, UK, April 1993 3 NIKHIL. R. S., PAPmFOULOS, G. M., and ARVWD: '*T a

multithreaded massively parallel architecture'. Roc. 19th Annual International Symposium on Computer Architecture, Gold Coast, Australia, May 1992

4 UK Patent Publication No. GB-2276742-A, Applicant University of Wesiminster, Inventor Nasser Kalantery. Agent Marks & Clerk, Date of oublication 5th October 1994

5 CENSER, L. M., and FEAUTTIER, P.: 'A new solution to the coherence Droblem in multicache svstems'. IEEE Truns on Combuters. 1978. C- , . ' , , 2. (12), pp.1112-1118

6 LAMPORT, L.: 'Time, clocks and the ordering of events in a

7 FUmOTO, R. M.: 'Parallel discrete event simulation', CACM, 1990, distributed system', CACM, 1978,Zl. (3, pp.55f5.555

33. (10). uu.2&53 8 BLUME ki aL: 'Automatic detection of parallelism: a grand challenge

for high-performance computing'. IEEE Parallel & htribrrted Technology. Fall 1994, pp.3747

0 IEE 1995 The authors are with the Centre for Parallel Computing, School of Computer Science & Information Systems Engineering, University of Westminster, 115 New Cavendish Street, London W1M 8JS, UK, E-mail: [email protected]; Fax: 0171- 911 5143. Prof. Wilson is an IEE Fellow and Dr. Winter is an LEE Member.

136 COMPUTING & CONTROL ENGINEERING JOURNAL JUNE 1995