donald bren school of information and computer sciences - …schark/valencia-thesis.pdf ·...

UNIVERSITY OF CALIFORNIA ,IRVINE

Service Address Routing for Concurrent Computing

DISSERTATION

submitted in partial satisfaction of the requirementsfor the degree of

DOCTOR OFPHILOSOPHY

in Information and Computer Science

by

Daniel Jesus Valencia Sanchez

Dissertation Committee:Professor Isaac D. Scherson, Chair

Professor Alexander V. ViedenbaumProfessor Tony Givargis

2007

Dedication

First and foremost, to God. Through Whom all things are made.

To my family, for all the support. Especially my parents, whobrought me up to this

point, and my wife and son, who are the inspiration and the motivation that drive me. Mil

gracias.

iii

Table of Contents

L IST OF FIGURES VI

L IST OF TABLES VII

ACKNOWLEDGMENTS VIII

CURRICULUM V ITAE IX

ABSTRACT OF THEDISSERTATION XI

1 INTRODUCTION 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.1 Field History and Current Trend . . . . . . . . . . . . . . . . . 91.4.2 Jini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.4.3 CORBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.4.4 Carmen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4.5 Content-Addressable Networks . . . . . . . . . . . . . . . . . 141.4.6 SMiLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4.7 Cluster Libraries and Daemons . . . . . . . . . . . . . . . . . 151.4.8 Distributed Operating Systems . . . . . . . . . . . . . . . . . . 19

1.5 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251.5.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 251.5.2 Proposed Novel Approach . . . . . . . . . . . . . . . . . . . . 27

2 SERVICE ADDRESSROUTING 282.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.1.1 SAR and the OSI Model . . . . . . . . . . . . . . . . . . . . . 292.1.2 Resource Management . . . . . . . . . . . . . . . . . . . . . . 30

2.2 Proposed Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 312.2.1 A Service-Address Routing Switch . . . . . . . . . . . . . . . 322.2.2 A Hierarchical SAR Switch using Least Common AncestorNetworks 332.2.3 Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

iv

2.2.4 Overhead Analysis . . . . . . . . . . . . . . . . . . . . . . . . 382.2.5 Routing Algorithms . . . . . . . . . . . . . . . . . . . . . . . 392.2.6 Knowledge Distribution Techniques . . . . . . . . . . . . . . .40

2.3 A Programming Paradigm for SAR Applications . . . . . . . . . .. . 412.3.1 A Service-Oriented Programming Model . . . . . . . . . . . . 422.3.2 SAR in Concurrent Computing . . . . . . . . . . . . . . . . . 442.3.3 Example: Matrix Multiplication . . . . . . . . . . . . . . . . . 452.3.4 Example: 2D Fast Fourier Transform . . . . . . . . . . . . . . 472.3.5 Example: LU-Decomposition . . . . . . . . . . . . . . . . . . 482.3.6 Service Address Routing Programming Interface . . . . .. . . 50

2.4 Intelligent Networks on Chip . . . . . . . . . . . . . . . . . . . . . . . 532.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.4.2 The Service Table (ST) . . . . . . . . . . . . . . . . . . . . . . 552.4.3 The Link Interface (LI) . . . . . . . . . . . . . . . . . . . . . . 572.4.4 The Crossbar . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3 EXPERIMENTS AND ANALYSES 623.1 Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3.1.1 A Fully-Distributed Service-Addressed Architecture . . . . . . 623.1.2 SAR Implementation Within the Network Fabric . . . . . . .. 633.1.3 The Proposed SAR Programming Paradigm . . . . . . . . . . . 633.1.4 The Proposed SAR Switch Implementation . . . . . . . . . . . 633.1.5 The Use of Fixed-Length Service Table Caches . . . . . . . .. 633.1.6 The Use of Limited Queues . . . . . . . . . . . . . . . . . . . 643.1.7 The Use of an LCAN Implementation of big Switches . . . . .64

3.2 Analysis of SAR’s Programming Model . . . . . . . . . . . . . . . . .643.2.1 TCP Complexity Analysis . . . . . . . . . . . . . . . . . . . . 653.2.2 Implementing a Parallel Algorithm Using SAR . . . . . . . .. 663.2.3 UNIX RPC Implementation . . . . . . . . . . . . . . . . . . . 683.2.4 Java RMI Implementation . . . . . . . . . . . . . . . . . . . . 693.2.5 Jini Implementation . . . . . . . . . . . . . . . . . . . . . . . 713.2.6 MPI Implementation . . . . . . . . . . . . . . . . . . . . . . . 733.2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.3 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 763.3.1 Event-Driven SAR Hierarchical Network Simulator . . .. . . 773.3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 813.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . 87

3.4 Cost/Performance Analysis of SAR LCANs . . . . . . . . . . . . . .. 96

4 CONCLUSIONS 994.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 994.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 PUBLICATIONS THAT RESULTED FROM THIS WORK 102

v

BIBLIOGRAPHY 103

vi

List of Figures

1.1 Distributed Systems Design Space . . . . . . . . . . . . . . . . . . .. 21.2 A Networked System . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 A Distributed System . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Comparison of SAR and Routing based on Node Address . . . . .. . . 292.2 A two-level 3x2 LCAN . . . . . . . . . . . . . . . . . . . . . . . . . . 332.3 Overview of a SAR Switch Design . . . . . . . . . . . . . . . . . . . . 352.4 SAR Protocol Packet Header Fields . . . . . . . . . . . . . . . . . . .362.5 Process of adapting a solution to SAR . . . . . . . . . . . . . . . . .. 442.6 Skeletons of a Remote Process (left) and a Stub (right) . .. . . . . . . 452.7 Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 452.8 LU-Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.9 Correspondence between UNIX sockets and SAR . . . . . . . . . .. . 512.10 Overview of an INoC Switch . . . . . . . . . . . . . . . . . . . . . . . 562.11 The Service Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562.12 The Address Generator . . . . . . . . . . . . . . . . . . . . . . . . . . 562.13 The Link Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582.14 The Service Table Interface . . . . . . . . . . . . . . . . . . . . . . .. 582.15 The Switch Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 582.16 The Link Interface Logic . . . . . . . . . . . . . . . . . . . . . . . . . 602.17 The Link Interface Put Together . . . . . . . . . . . . . . . . . . . .. 602.18 The INOC LCAN Switch’s Mutilated Crossbar . . . . . . . . . . .. . 61

3.1 The Simulator’s Core . . . . . . . . . . . . . . . . . . . . . . . . . . . 773.2 The Simulator’s Modular Structure . . . . . . . . . . . . . . . . . .. . 773.3 The Simulator Implementation’s Class Hierarchy . . . . . .. . . . . . 783.4 Results: Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893.5 Results: Limited Queues . . . . . . . . . . . . . . . . . . . . . . . . . 913.6 Results: Composite . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923.7 Results: Cluster Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 933.8 Results: Unbalanced . . . . . . . . . . . . . . . . . . . . . . . . . . . 953.9 Implementation of a full Crossbar . . . . . . . . . . . . . . . . . . .. 963.10 The INOC LCAN Switch’s Mutilated Crossbar . . . . . . . . . . .. . 97

vii

List of Tables

1.1 Feature comparison between different distributed systems . . . . . . . . . 8

2.1 Type of Packet (ToP) meaningful values . . . . . . . . . . . . . . .. . . 372.2 Fields in a Service Table. . . . . . . . . . . . . . . . . . . . . . . . . . .40

3.1 Experiment Description: Caches . . . . . . . . . . . . . . . . . . . .. . 843.2 Experiment Description: Composite . . . . . . . . . . . . . . . . .. . . 853.3 Experiment Description: Cluster Size with Light Workload . . . . . . . . 863.4 Experiment Description: Unbalanced with Light Workload . . . . . . . . 87

viii

Acknowledgements

I would like to give special thanks to my advisor, who helped me in ways as many as I wouldnot be able to cover in this page. He has also been my friend, psychologist, teammate andjudge, among other roles. I could not have chosen a better one.

Also, I want to acknowledge all the help received by my other groupmates, the onesthat have left and the ones that still remain. It is importantthat I mention John Zapisekwho coined the term “Service Address Routing”; Enrique Cauich, who helped me withdebugging and proposed the first initializer module for the simulator; Romain Blavier, wholed the way to the service discovery algorithms; Perrine Geib, who provided invaluable helpin the development of the SAR Programming Interface; and last, but never least, RichertWang, who also worked with me during the early stages of this work.

ix

Curriculum Vitae

Education:

• [Ph.D.] University of California, Irvine; Irvine, CA, USA; 2003-2007. GPA: 3.8

Field: Information and Computer Sciences.

Thesis: Service Address Routing for Concurrent Computing.

• [M.S.] University of California, Irvine; Irvine, CA, USA; September 2005. GPA: 3.7

Field: Information and Computer Sciences.

• [M.S.] CICESE; Ensenada, BC, Mexico; 2000-2002. GPA: 87

Field: Computational Sciences.

Thesis: Parallel Rank Processor.

• [B.S.] Kino University; Hermosillo, Sonora, Mexico; 1996-2000. Magna cum laude.GPA: 93.4

Field: Computer Systems Engineering.

Publications:

• Isaac D. Scherson, Daniel S. Valencia.Service Address Routing: A Network Archi-tecture for Tightly Coupled Distributed Computing Systems. Proceedings of the Inter-national Symposium on Parallel Architectures, Algorithmsand Networks (I-SPAN)in Las Vegas, Nevada, U.S.A., November 2005.

• Isaac D. Scherson, Daniel S. Valencia, Enrique Cauich.Service Discovery for GRIDComputing Using LCAN-mapped Hierarchical Directories.Special Issue on “GridTechnology” of The Journal of Supercomputing, in press.

• Isaac D. Scherson , Daniel S. Valencia, Enrique Cauich.Service Address Routing:A Network-Embedded Resource Management Layer for Cluster Computing.ParallelComputing Journal, Elsevier, in press.

• Isaac D. Scherson, Daniel Valencia, Enrique Cauich, John Duselis, Richert Wang.Federated GRID Clusters using Service Address Routed Optical Networks.FutureGeneration Computer Systems, Elsevier, in press.

x

• Richert Wang, Enrique Cauich, Daniel Valencia, and Isaac D.Scherson.High Perfor-mance Clusters Using NEOS. IEEE Cluster 2007 in Austin, TX, U.S.A., September2007.

Pending Publications:

• Daniel S. Valencia, Isaac D. Scherson.Cost-Performance Analysis of Service-Address-Routed Least-Common-Ancestor Networks.Submitted to The Journal of Intercon-nection Networks.

• Daniel S. Valencia, Isaac D. Scherson.Intelligent Networks on-Chip Based on Service-Address Routing.Submitted to IEEE Computer Architecture Letters.

• Daniel S. Valencia, Isaac D. Scherson.Performance Analysis of Service-Address-Routed, Tightly-Coupled Computing Clusters.Submitted to Elsevier’s Journal onParallel and Distributed Computing.

• Daniel S. Valencia, Isaac D. Scherson.Programming Service-Address-Routed Clus-ter Systems.Submitted to IEEE Transactions on Parallel and DistributedSystems.

Employment History

• [Summer 2006] Graduate Student Lecturer at UC Irvine: Irvine, California, USA.

Course Taught: Project in Operating Systems.

• [2004-2007] Grader and Teaching Assistant at UC Irvine: Irvine, California, USA.

Various courses

• [2003] Associate Professor at Sonora University: Hermosillo, Sonora, Mexico.

Courses Taught: Distributed and Parallel Computing, Supercomputing, ComputerGraphics

• [2003] Associate Professor at Kino University: Hermosillo, Sonora, Mexico.

Course Taught: Computer Graphics

• [1996-1998] Programmer and systems integrator at Megatech: Hermosillo, Sonora,Mexico.

• [1994-2000] Computer systems consultant and programmer: Hermosillo, Sonora,Mexico.

xi

Abstract of the Dissertation

Service Address Routing for Concurrent Computing

By

Daniel Jesus Valencia Sanchez

Doctor of Philosophy in Information and Computer Science

University of California, Irvine, 2007

Professor Isaac D. Scherson, Chair

A novel Interconnection Network Architecture is proposed for integrating tightly cou-

pled Distributed Systems. It allows the implementation of Service Address Routing(SAR):

network communications between components are based on the“name” of requested ser-

vices. It is a client-server communications model, and optionally includes the traditional

physical addressing schemes. This enhanced network architecture registers the capabilities

of the components attached to its ports and directs service requests to appropriate server

nodes. The Network can perform Resource Allocation by placing services in nodes that

can support them. Load Distribution is achieved by binding requests to adequate servers

with enough free resources. Furthermore, SAR effects a simple, yet efficient and trans-

parent resource discovery mechanism by means of a distributed content-addressable search

at each network port. When migration of services is allowed SAR permits the network to

xii

realize more resource management functions, such as load balancing, normally ascribed

to higher-level SSI Distributed Operating Systems. As partof this work, SAR is justified,

two implementations are proposed at different integrationlevels, a programming paradigm

and an interface based on it are proposed and analized, an event-driven hierarchical net-

work simulator is developed and used to analyze and compare the paradigm’s behavior, a

network protocol set is devised, and the Least-Common Ancestor Networks are used and

analized as a means to create scalable hierarchical systems.

xiii

CHAPTER 1

INTRODUCTION

The main focus of this work is a novel interconnection network architecture for tightly cou-

pled distributed systems. In such systems, components are usually digital data processing

nodes that communicate with each other through an interconnection network[7]. The net-

work implements all the necessary data flow for nodes to exchange information as needed

by applications. Depending on node complexity, node functionality, speed of the network

and the number of nodes that compose the distributed system,a distributed system design

space can be defined as as a 4D Cartesian space where dimensions represent these prop-

erties (Figure 1.1). Thus, tightly coupled systems are found near the origin, while loosely

coupled systems reside on the area further away.

In networked systems, applications are responsible for requesting and managing needed

resources [35, 9]. MPI[38, 39] and PVM[4] are examples of layers between the system and

the applications. In distributed systems, some of the resource management is made trans-

parent to applications by including a coherent Single System Image (SSI[28]) as part of

the system itself. The type of management includes resourceallocation, service requests,

resource/service discovery, sharing, selection, etc. Thetrend has been to push that func-

tionality as far down as possible, away from the applications. In later sections we expose

the application domain and work related to this piece of research, as well as the claims

analyzed in the rest of this document. Once the path has been laid down, Chapter 2 defines

and studies the proposed solutions, and Chapter 3 presents analytical and experimental

analyses.

1

Figure 1.1: Distributed Systems Design Space

Figure 1.2: A Networked System

2

1.1 MOTIVATION

In Figure 1.2, we illustrate as a block diagram a traditionalnetworked system, with two

nodes being apparent. We include the seven OSI[8] layers next to the components that

traditionally performs the associated tasks as a reference. The model is divided threefold,

representing the main levels of abstraction for network activities: interconnection, routing

and application.

In this model, applications share resources by sending messages amongst themselves.

An application’s knowledge consists of: what is the physical node is executing them and

what formats they need to use to make communication effective. The OS provides access to

the hardware, which is used only to move packets. For all practical purposes, a distributed

system is created at the application level; this means that applications have to perform all of

the activities related to discovery, sharing and selectionof resources. One attempt to extract

the underlying abstraction layers off the applications, i.e. separating what the application

does from what the system should provide, is the creation of amiddleware layer. This layer

has evolved into a Distributed Operating System. However, in essence, both are the same.

Both consist of a kernel running in each node, and a layer providing a Single System Image

(SSI) to the Virtual Machines that support user applications.

Figure 1.3 depicts how a Distributed Operating System, similar to the one we just de-

scribed, is organized. It also shows how the corresponding layers fit in the OSI model. As

is evident in the diagram, no change is made in what the kernelperforms (routing) and what

is required from the network (to move packets). So, althoughit is feasible to abstract the

presentation and session layers off of the applications as well, one fact does not change:

all of the management functionality that was abstracted from the applications is still being

run by the same processor, consuming cycles that could be used for productive calculations

instead.

Thus far, we can distinguish the following OS responsibilities:

3

Figure 1.3: A Distributed System

1. To create an SSI by providing a platform that transparently spans all nodes.

2. To ensure the system works properly.

3. To address malfunctions.

4. To channel requests to the appropriate node (including self).

5. To manage local hardware properly.

6. To balance the work and storage load among the nodes.

In turn, the Interconnection Network’s responsibilities that can be identified are:

1. To provide simple routing (finding routes for packets).

2. To move packets.

To accomplish effective communications between related processes in distributed sys-

tems, the interconnection networks have been optimized to handle packet routing at very

high speeds and with very low latency in a number of interconnection topologies. As

technology advances, networks can handle more than just thetraffic control required by

the basic function of moving packets between nodes. This additional functionality of the

4

distributed systems interconnection network is the main contribution of this work. A novel

routing strategy that permits the implementation of a number of resource management func-

tions using a very simple but elegant paradigm is introduced. The idea is to “hide” from

the application the physical addresses of related nodes, while allowing service requests to

reach their destinations without application intervention. By doing this, a novel approach

to Parallel and Distributed Operating Systems developmentis proposed. It is based on em-

bedding complex resource management operations in the cluster switch and reducing the

delays caused by current TCP-IP based routing mechanisms. This system will enable the

more efficient usage of computing power residing in high performance computing clus-

ters. The basic ideas presented here can be seen in other big systems such as CORBA,

and implemented at some extent in Jini and Content-Addressable Networks. In this work,

they have been enhanced and applied to High Performance, Tightly Coupled Clusters of

Computers instead of the more loosely coupled Grid. The contribution of this work is a

novel routing mechanism for cluster interconnection networks (switches) dubbed Service

Address Routing, or SAR for short. It allows communicationsbetween modules of a dis-

tributed system in a manner that is location independent, which in turn allows efficient and

fast resource management to be performed within the switch itself rather than in the node’s

middleware.

Putting SAR at the network and transport layers would automatically set TCP/IP aside,

so that it can be used only by the applications that really need it. In this manner, nodes can

make use of their valuable CPU cycles for what they were meantfor: processing scientific

applications.

1.2 APPLICATION DOMAIN

Referring back to Figure 1.1, one can notice but a number of different implementations of

one very simple idea: to put together a set of components, in away that they can talk to and

5

understand each other. The basic principle of distributed systems is to do it in a way so the

user will only see a single, coherent system [2]. The medium used by those components

to communicate is theInterconnection Network. Regardless of the system scale, and of

the components chosen for its integration, there is an Interconnection Network in charge of

relaying communications between nodes.

A new routing paradigm is proposed and studied in this work which allows all com-

ponents of a distributed system, with the exception of the Interconnection Network, to be

oblivious of the resource management tasks involved in the good operation of the system.

Taking into account that this is a vital component of a distributed system of any scale, it is

natural to understand that the application domain of this paradigm includes all distributed

systems. Chapter 2 shows examples of systems of different scales, and how they can ben-

efit from this new paradigm. Though this list is not meant to beexhaustive, it is used to

exemplify this fact.

Another fact that is worth mentioning is that a routing paradigm is a set of guidelines

used by an Interconnection Network to make communications reach their corresponding

destinations. Thus, it belongs to an abstraction level beyond implementation. For this

reason, it is fair to state that what is proposed in this work does not necessarily belong to a

specific OSI layer, but instead it can beimplementedat any level of the OSI stack.

1.3 PROBLEM STATEMENT

What have seen is how the nodes are performing tasks that do not help to the completion

(or correctness, for that matter) of the user processes thatare taking place, and which are

the purpose of the System. Instead, they are “wasting” time and power by performing

distributed resource management tasks. This becomes evident as we parse the diagram of

any distributed system identifying the piece of hardware that performs these tasks. Not

amazingly, anything from the network layer on is performed by the main processor(s) in

6

each node.

It is irrelevant of which scale we are talking about; every system is designed in that

way: the interconnection network moves packets or creates circuits, and the nodes do all

processing related to user and system tasks. For all practical purposes, we are going to talk

about the OS when referring to the entity that is in charge of managing resources, whether

implemented in software or as part of the hardware. Traditional OS responsibilities include:

• Managing local hardware properly.

• Creating an adequate application environment.

This is a somewhat shorter list than the one presented in section 1.1. The nodes are

thus overloaded, which should be avoided for the processingcycles are precious and as

many as possible should be devoted to doing application tasks. What we want is a stream-

lined, optimized kernel that will not selfishly use resources that should be dedicated for the

applications, for problem-solving activities. This overhead can be categorized in several

different ways. For instance, we can use the following:

• Constant overhead; e.g., deciding if a service is availablelocally, migrating a process;

• Scalable overhead; e.g., looking for a remote service, assuring cache/copy coherence.

This categorization can help us identify drawbacks of traditional distributed system de-

sign: distributed resource management tasks being performed on every node means over-

head actually being multiplied by the number of nodes in the system, and that the extra

functionality implies a larger code base, which then implies a more error prone system.

The desired solution to such problems would be a system in which the OS manages

local resources and grants their access to the applications, according to its own policies,

and tose issues that cannot be handled locally are relayed tothe network. In such a system,

service discovery and load allocation should be transparent to the user applications, even

7

1 2 3 4 5 6 7 8Load Dist. Yes Yes Yes No No Some Yes YesDyn. Load Bal. No No No No No Some Some YesDyn. Svc. Alloc. No No No No No Some No YesOSI Layer 7 7 7 7 2+ 7 4+ 3+Grid Support Yes Yes Yes No No No No YesI2N No No No No No No No YesNode State Partial Partial Partial Partial Partial Varies Varies NoneSSI No No No Yes Yes Some Yes YesProg. Language Java None Java None None Varies Varies None

Table 1.1: Feature comparison between different types of distributed systems. The columnnames are, in order: 1) Jini, 2) CORBA, 3) Carmen, 4) CANs, 5) SMiLE, 6) ClusterLibraries, 7) Distributed Operating systems, and 8) SAR

to requesting nodes, and the OS should incur less overhead since remote resource manage-

ment tasks, and hence their corresponding overhead, are diminished to just forwarding a

service request to the network.

1.4 RELATED WORK

A new routing paradigm is proposed in this work. As mentionedin section 1.2, distributed

systems from the whole spectrum categorized in Figure 1.1 are possible application do-

mains. Although SAR can be seen as a close cousin of other approaches found in Internet

based systems that use service addresses of some sort to locate functionality or resources, it

differs in a number of key points. The main one is that Internet-based GRID-like computing

systems that use naming services are all thought for very loosely coupled systems and use

the Internet as their transport medium. Setting aside the fact that SAR is not only at a differ-

ent abstraction layer than Internet- and GRID-based resource discovery and management

systems, there are still some conceptual differences worthy to emphasize. Table 1.1 shows

a feature comparison chart between SAR and its most similar relatives: CORBA[40, 26],

JINI[41, 41], Carmen[25] and Content-Addressable Networks (CANs[31, 15, 41]). These

systems were selected because they represent families withsimilar characteristics, of which

8

each system is the most important representative. This section is devoted to giving a short

history of Distributed Systems and identifying the currenttrend of events, as well as com-

paring the system presented in this work to other similar systems. As will be seen in

Chapter 2, SAR provides a natural way to implement load distribution and balancing, as

well as fault tolerance, at any OSI layer.

1.4.1 FIELD HISTORY AND CURRENT TREND

There has been much work in distributed computer systems, and they can be tracked over 30

years in the past where the idea started from distributed databases[6], formerlyshareddata

bases[34]. This apparently ever-new subject actually started almost half a century ago, in

1959, when Leineret al. [23] built PILOT. It was considered aMultiple Computer System,

rather than a single, multi-module computer. Distributed Systems were not a subject after

that for a couple of decades, when Leslie Lamport published the paper[21] in 1978 that

would redefine the concept of Time within a Distributed System.

Quickly after that, the termDistributed Systemthat was seldom used started to kick in,

producingDistributed File Systems[17, 36]. Eventually, the trend landed on the termDis-

tributed Operating Systemwith LOCUS[29, 42]. Thence started the Distributed Operating

System hype that lasted the whole decade. Systems were implementing management of re-

mote resources as part of the system calls, inside the OS. Themodel was clean, transparent

and convenient for the applications, but incorporating allthe logic inside operating systems

made them bulky, inefficient and difficult to create, mantainand enhance.

The trend continued with the rise of libraries such as PVM[4]and MPI[38], eventually

portable, that would link to applications to provide them with a dependable Network Trans-

parency layer, otherwise known as Single System Image (SSI). After the relatively stable

startup of the nineties, the emergence of the Internet as a gobally available transport-layer

network with reliable communications and with standard protocols and programming inter-

face triggered a new wave of Distributed Systems: the Globally Distributed Systems, such

9

as SETI@home[43], the GRID[11] and a plethora of other systems.

Service-Oriented Programming(SOP) is not new; in fact, it is a natural evolution from

Object-Oriented Programming(OOP) the same as the latter isfrom Structural Program-

ming. Shared libraries publish services that applicationsuse and profit from. However,

when it comes to distributed systems, SOP has evolved slowly. First, UNIX’s Remote

Procedure Call (RPC[35]) allowed invoking services from remote machines by directly

specifying the machine address. Naming services such as Internet’s Domain Name Ser-

vice (DNS[22]) allowed for the flexibility of chasing nodes through address changes, but

still the caller was bound to knowing which node held the wanted service. Evolution took

these systems to broker-based distributed systems such as CORBA [26], but it wasn’t un-

til Jini[41] that a node-free architecture was available for the applications, at least from

their own point of view. However, this apparent service-addressed architecture is bound to

centralized directory servers and service-address to node-address translations, so commu-

nications are effectively being done by addressing nodes istead of services.

At this point in time, distributed systems are left with technologies evolved at different

paces and in different directions, depending on the scale and how tightly or loosely coupled

the systems are. For instance, distributed systems on chip are taking advantage of the

high integration abilities of the industry to implement full SMP systems on chip (4-way

expected for 2007). Single-machine multiprocessors, formerly SMP systems, are evolving

into Non-Uniform Memory Access (NUMA) systems with Distributed Shared Memory

(DSM) where every node houses a SMP system. Clusters of Workstations make use of MPI

or MOSIX for resource sharing and management. Lastly, massively-parallel systems have

evolved twofold: MPP supercomputers using high-performance dedicated interconnection

networks, and massively distributed systems using loosely-coupled high-jitter Internet for

transport, and Service-Oriented architectures for application programming.

10

1.4.2 JINI

Jini is a service-oriented distributed system architecture. Every Jini service is described by

a Java programming language interface. The basic components that build up a Jini system

are (1) Clients, (2) Servers, and (3) lookup services. The Jini system execution is composed

of different algorithms:

• At startup, all servers and clients willdiscoverthe lookup services using multicast or

unicast messages.

• All servers will register their capabilities to the lookup services by sendingServi-

ceItems. These are made up of three components: a Service ID, a Proxyobject and a

set of attribute objects.

• When a client wants to make use of a service, it sends the request to the lookup

service, and in return it will receive the Proxy Object.

Since Jini’s functionality is based on mobile code, there isno need to preinstall any

proxies in the clients, as they will always be downloaded from the Lookup Service when

needed. Once the client has the Proxy, it can use it to gain access to the service directly.

The Jini architecture incorporates the notion of aleaseto maintain up-to-date databases

in the Lookup Services as to which servers are actually usable. When a server registers

to the Lookup Service, it receives such a lease, which describes the time in which the

server will be accessible. If the server does not renew the lease before it expires, the

Lookup Service will assume the service is not present anymore, or it is otherwise faulty.

The concept of a lease can be extended to client-server communications to allow servers

to stop helping clients that don’t renew their leases, on thesame basis of assuming that if

a client does not renew the lease, it is either faulty, not present or does not care about the

service anymore.

There are several [15] initiatives for using Jini on the Grid[12, 3, 16, 18, 37]. The most

remarcable uses aGrid Access Point(GAP) server and goes under the name JGrid[18],

11

formerly JM[19]. GAP Servers are to be used by Jini servers and clients in place of the

Lookup Service, and they provide access to other remote lookup services (GAPs as well).

The Lookup Service is a centralized entity. Although there can exist several, they are

each centralized information repositories of their own groups. In SAR, when a hierarchical

structure exists, each switching element is a module of a distributed equivalent for such

entity. However, a SAR network has several other capabilities, which include the following:

• Immediate noticing of client faliures or disconnections: Instead of waiting for the

lease to expire, the network knows immediately when a link isbroken.

• Load distribution capabilities: Although Jini can performload distribution in a way,

it cannot avoid a client making use of a service several timesbefore the lease expires,

which would make it difficult for a Lookup Service to account for node workloads

effectively. It is a characteristic of SAR that each servicerequest is routed indepen-

dently, and hence the network can control the placement of every service request with

better detail.

• Load balancing capabilities: It completely escapes the Jini Lookup Service the ability

to stop running jobs and redistributing them amongst the servers to better balance

load. However, it is an inherent property of SAR the network’s awareness not only of

the nodes’ loads, but also the precise tasks they are runningand the capabilites they

possess. Thus, it is simple to implement a mechanism to balance load amongst the

nodes, especially if the services are checkpoint-aware.

• Homogeneous Interface: Instead of receiving a stub or proxythrough which to com-

municate, or specify an interface through an Interface Description Language (IDL)

nodes only request services and send arguments in a homogenous interface, all from

start.

As is evident, Jini to be able to solve the same set of problemsthat SAR does, one

would have to consider Load Balancing systems, such as Macromedia’s JRun[1], or the

12

ones presented in[5] and [20] on top of it. This requirement imposes yet another layer on

top of the already many that are present. That incurs a tremendous amount of processing

overhead in the nodes, whereas SAR achieves the opposite which is to remove that kind of

overhead from them.

1.4.3 CORBA

The Common Object Request Broker Architecture, widely known as CORBA, is another

service-oriented distributed architecture, also stub- orproxy-based, developed by the “Ob-

ject Management Group” (OMG). The basic entity is theObject Request Broker(ORB).

ORBs are interconnected via the Internet, and they also connect to clients and servers.

Conceptually, ORBs and network links together constitute the fabric that connects clients

and servers together. Object Services are used to find objects based on different criteria,

like name or attributes. CORBA specifies that an Interface Description Language (IDL) be

used to describe objects so that other applications and objects be able to make use of them.

Descriptions expressed in an IDL can be later translated to aprogramming language and

used as interfaces to communicate with objects. An example of IDL is CORBA IDL.

The inherent nature of a CORBA system is to be an object repository. In this sense,

what CORBA does is to find a sought object, usually a stub for immediate transfer or an

implementation for packet routing. In other words, CORBA will find an interface for client

applications, and then effect communications between suchinterface and the actual imple-

mentation. In essence, CORBA is also a proxy-based lookup sytem, with the difference

that brokers are used to effect communications. However there are Load Balancers for

CORBA[27, 14, 33], they also incur more overhead for the nodes, and they impose an ex-

tra layer of algorithms and bookkeeping. These are the reasons why all of the advantages

SAR has over Jini are also valid over CORBA.

13

1.4.4 CARMEN

CARMEN and DIET are two examples of hierarchical distributed resource locators. Their

hierarchical architecture may look very similar as that of aSAR LCAN, and they solve the

same basic problem which is resource discovery. However, SAR has conceptual differences

to these other two systems as well. As mentioned before, it isthe interconnection network

with its switching devices that perform service discovery,hence the name “Network Em-

bedded”. Furthermore, SAR also solves the problem of managing such resources, whereas

all related work is mostly devoted to resource discovery. A very important fact to keep

in mind while comparing these systems is that SAR is a routingparadigm, which we are

comparing side-to-side with systems that are implementations of a similar idea applied to

resource discovery and management; hence, the applicationdomain of SAR, as can clearly

be seen, is much wider.

1.4.5 CONTENT-ADDRESSABLENETWORKS

Content-Addressable Networks (CANs) provide an indexing architecture that allows loca-

tion of resources based on their contents. The main problemsfaced by these architectures

are defining the indexing keys, distributing resources and routing. At a glance, they are

distributed hash tables, with possibly several keys. The table is divided in zones following

a binary tree-like structure, and each zone is stored in a node; hence, no node is a single

point of failure. Requests are lookup operations over a key,and they are routed follow-

ing the zone tree structure until they find the node corresponding to the zone that contains

the data sought. Routing is effected by performing this lookup operation, and then a local

lookup in the node holding the zone. When a node joins the CAN,a new zone is created by

splitting an existing zone. When a node leaves, its contentsare merged with a neighbouring

zone to produce a larger zone.

The basic difference to our project relies on the fact that SAR is a routing paradigm that

uses a service identifier to locate a resource. This means that the interconnection fabric

14

understands about services and requests. In contrast, CANsare by definition node-driven.

The nodes that hold the information are also the ones that realize the routing. Furthermore,

CANs were incepted for distributing information, whereas SAR was meant for finding

resources of any kind. However, there are many undeniable similarities; one could establish

a correspondence between keys and service identifiers, and between resources and values.

The advantages of using SAR, besides having separate entities for the discovery and routing

processes instead of nodes, also include load distributionand balancing capabilities, and

an unspecified network topology for the routing process thatcan implement fault-tolerant

routing (with LCANs for instance). Additionally, the routing information is considered to

be safe in the network; it can be stored in a fault-tolerant fashion, whereas if a CAN node

suddenly fails, the information therein would be considered lost.

1.4.6 SMILE

The name is an acronym forShared-Memory in a LAN-like Environment. This is a software

implementation of IPC mechanisms that sit on top of a commercially-available hardware

shared memory system called SCI. SCI stands for Standard Coherent Interface. It is the

IEEE/ANSI standard 1596 for shared memory. SMiLE uses a PCI-SCI bridge to intercon-

nect computers physically, and on top of that single, sharedaddress space implements a

series of IPC mechanisms, such as PVM, MPI, AM and BSD sockets. Their systems are

tested against Ethernet-based communications, and are proven to be in general 10% faster.

1.4.7 CLUSTER L IBRARIES AND DAEMONS

As mentioned before, middleware can be in the form of libraries. Such libraries link to

the applications, and there usually is a daemon running in the nodes to perform the actual

management. Therefore, in these systems, Distributed Resource Management is at the

application layer, slightly below user applications.

15

Examples of such systems, one can find many. In this section weare going to discuss

about the most important ones: PVM, MPI, Charm++ and Condor.

PVM AND MPI

PVM is an abstraction of the underlying hardware architecture into a “Parallel Virtual Ma-

chine”. MPI is defined, in contrast, as a “Message Passing Interface” for parallel applica-

tions. Although from a simplistic user point of view, they are both message passing APIs,

there are design differences that become evident as we consider the design goals for each

of them.

PVM was created to allow joining different and diverse computing machines to allow the

user to use them as though it were a single, bigger machine. With this in mind, the creators

built a system that allows communications among computers of different architectures,

and even different implementations of PVM. Also, as part of the design, the PVM system

supports fault-tolerant applications by means of notifying about faulty tasks or nodes.

MPI, in contrast, was created to allow fast message passing to be effected within a fairly

homogeneous computer cluster or a MPP. The motive was to standardize the interface pro-

vided by MPP vendors so that applications could run unmodified in different parallel com-

puters. However, different implementations of MPI are not required to intercommunicate,

and they usually do not support such feature.

In short, PVM has given up speed for interoperability and fault tolerance, and has a

simple interface. MPI, in contrast, has centered its efforts in creating a vast and very flexible

interface for message passing, so that implementors can focus on performance.

The Message Passing Programming Paradigm is different fromthe Service-Oriented

Programming Paradigm in many aspects. For starters, Message Passing relies on appli-

cations sending messages to explicit application instances, whereas the Service-Oriented

paradigm removes the necessity of knowing server nodes or processes beforehand. It is

a remarkable basic property of PVM/MPI the fact that there isno preset mapping from

16

tasks or processes to nodes, but the fact that the server taskhas to be known still remains.

Service Addresses can be thought as the next step, elliminating any static mappings from

functionality to task.

CHARM++

Charm++ is defined asan object-oriented portable parallel programming language based

on C++. However, it is implemented as a set of language extensions that are translated at

compile time into C++ language constructs or library function calls. It was developed in

the University of Illinois at Urbana-Champaing, in the Parallel Programming Laboratory.

The project started in the late 1980’s; however, the first implementation included only an

interface to remote method invocations and the corresponding library to suport it. Later

implementations included language extensions, object migration, and finally C++ bindings.

The main advantages of Charm++ reside in the fact that each parallel object (chare) is

an independent entity that can reside virtually anywhere. Chares are mapped to processors

dynamically and can be migrated by the run-time supporting infrastructure (charmd, the

Charm++ daemon). The overall interface was designed to somewhat resemble that of MPI,

and one compiles and runs Charm++ applications in a similar manner (using charmc to

compile and charmrun to execute an application).

The main differences between Charm++ and MPI/PVM are that the former is object-

oriented and includes communication primitives other thanmessage passing. Chares can

also communicate using shared objects.

There is an MPI implementation underway called “Adaptive MPI” that runs on top of

Charm++, and incorporates the process migration provided by the latter. At the time of

writing, it implemented the full MPI 1.1 standard and parts of MPI 2.0.

Although similar in certain capabilities, Charm++ and SAR serve different purposes.

Charm++ is basically a language that, similarly to the way MPI is being implemented on

top of a Charm++ implementation, can be implemented on top ofSAR using the network’s

17

Remote Procedure distribution and migration capabilities. This would make Charm++ im-

plementation simpler and become easier to add new features and maintain.

CONDOR

Condor isa specialized workload management system for compute-intensive jobs. It’s main

objective is tostalk workstations, waiting for them to be inactive, so that the system can

reclaim its computing power for the job pool. Computers joinand leave the Condor “us-

able” pool when they are inactive and when they are being used, respectively. This system

works similarly to SETI@Home, however its design and implementation were completely

independent.

Condor was designed to run user applications remotely, but it simulates a local run by

means of itsRemote System Calls. When a process triggers a system call, the runtime

environment will catch it and send it to the host that submitted the job. Thereby, Condor

does not require users to have login privileges in every machine of the computing pool.

There is a Condor module (a “submit universe”) that allows users to submit PVM into

the Condor job pool. The advantage of using Condor to start PVM jobs resides in the fact

that Condor-available nodes will run the PVM job. If a PVM jobis started outside the

Condor environment, it will run on any hosts regardless of whether they are currently being

used and may lag interactivity with a user. When a node is added or removed (because of

user absence/presence), the Condor system notifies the userapplications using the PVM

methods for that purpose.

Similarly to Charm++, Condor, MPI and PVM can all be implemented using SAR as

the underlying framework. These systems are meant to serve purposes different to each

other and also to those of SAR. SAR is both a network architecture and a programming

paradigm that allow native resource management within the interconnection fabric. It can

be the basis onto which these and many other systems (as will be seen in the next section)

can be implemented.

18

1.4.8 DISTRIBUTED OPERATING SYSTEMS

In the field of Distributed Operating Systems, much of the work started during the eighties,

and lasted until mid nineties. Alfred Lupper, from the University of Ulm, published a list of

such projects, which was last updated in 1995. Few successful projects have started since

(e.g. Kerrighed, 1999); the tendency is now towards service-oriented distributed resource

locators over the Internet. The systems we are going to analyze are the most successful

representatives of their design model.

LOCUS

In 1981, the team led by Gerald Popek from UCLA, developed thefirst real Distributed Op-

erating System. It was developed by Bruce Walker as his thesis project. By real we mean

that it was developed from an architectural design, insteadof adapting a network operat-

ing system to allow distribution of resource management. From the three main resources

an Operating System must manage, process and filesystem management were distributed

transparently. There was no evidence of a Distributed Shared Memory implementation

within the system.

The design model of this system included distributed file system and distributed pro-

cess management, while maintaining the UNIX user and application interfaces for upwards

compatibility. Besides these main features, they includedothers which proved necessary,

such as distributed synchronization and reliability mechanisms, copy coherence and sim-

plicity of implementation to increase performance. LOCUS ran on PDP-11s and VAXen

connected through 1Mbps and 10Mbps Ethernet links.

AMOEBA

Developed by Andrew Tanenbaum and his team, Amoeba is a one-of-a-kind Distributed

Operating System. Its architecture was very innovative andadvanced for its own time.

Although the system itself is not very successful regardingits user-base, many of the basic

19

concepts on which it was built have been borrowed by “modern”, current operating systems.

The system has undergone major makeovers throuhgout its lifetime, especially during

the 1980’s decade[1981, May 1990]. One of the immutable features of Amoeba is the

client-server architecture; applications request attention by means of service requests using

datagram messages, as opposed to System Calls via software interrupts used by traditional

(UNIX, CP/M) systems. Originally, object orientation was not an important design subject;

it was disregarded by the valid argument that servers could be implemented by objects,

which meant the approach was similar. Currently, the systemclaims to be “based on ob-

jects.” Similarly, there was an original distinction between the “ports” used to access servers

and the capability model, whereas now processes communicate via capabilities (which are,

in essence, the old ports).

Essencially, the nodes are categorized in “Processors”, Workstations, Specialized servers

and an optional Gateway for external network access. Processors are grouped in a log-

ical “Processor Pool”, from which processors can be graspedand released as needed by

user processes. The operating system (Monitor Layer in the Amoeba’s 6-layer design) re-

lays datagrams between processes and keeps record of where processes ought to be found.

Amongst the most important design features present in Amoeba, one can find, among oth-

ers, the following:

• Service-Orientation. Whereas modern systems slowly move towards the client-

server architecture, Amoeba was conceived from the start insuch a way that could

be regarded as an in-node emulation of basic service addressing as the OS foundation

and native IPC mechanism.

• Multithreading. Heavy processes have dominated the landscape for ever, until re-

cently that lightweight processes (or the extreme thereof,threads) have been allowed

to coexist in a process address space.

• API Emulation. Adopted by several other operating systems (e.g. Mach, FreeBSD),

20

API emulation is perhaps one of the features that can give an operating system utmost

flexibility. It provides the operating system with the ability of having a core that

is independent of the applications that will be run on it. By plugging diverse API

interfaces, an operating system can run native applications, as well as those coded

for other operating systems. Amoeba included from the beginning a UNIX API

emulator to be able to run programs written for UNIX as well asnative ones.

• Microkernel. In a time where monolithic kernels were the only choice, Amoeba

started the trend of Operating Systems with minimalistic, message-passing kernel,

where not only do all resource management functions run outside the kernel space,

they run in user space. There are many advocates for and against this architecture;

however, that discussion is completely out of the scope of this work.

In short, Amoeba can greately benefit as a SAR application, bymerely relaying all non-

local messages to the network, since it was ready for this kind of facility from its inception.

The fact that tere were no SAR networks available at the time meant the mechanism had to

be implemented in the OS kernel.

SSI OPERATING SYSTEMS BASED ONL INUX

In the dawn of Distributed Operating Systems, most people were working with UNIX (e.g.

MOSIX), altering it and creating partial or total SSI support. With the growing popular-

ity, maturity and accessibility of BSD, and Linux afterwards, most projects transitioned

to one, then to the other OS. Currently, most people doing Operating System (especially

Distributed Operating System) research are using Linux as the testbed for their tools and

algorithms. In this section we discuss the three major such Distributed Operating Systems

that provide a SSI.

• MOSIX. It started as a project called ”UNIX with Satellite Processors” in 1977. It

was a modification to Bell Labs’ UNIX for PDP-11’s. The project underwent several

21

modifications regarding its name (later it was called MOS, then NSMOS, and at last

MOSIX), target operating system (UNIX, then BSD/OS, then Linux) and machine

architecture (from PDP-11, Motorola 68000, NS32x32, VAX, and finally Intel). They

are currently working on an amd64 port.

Bugs have been corrected, features have been added, algorithms have been enhanced

and replaced, and the user interface as well as the installation and usage procedures

have been improved; however, its main design keeps unchanged: it is a kernel patch

for a UNIX(-like) Operating System that allows a set of networked independent

workstations (possibly with multiple processors each) to work as a single machine

where all processors reside. MOSIX supports preemptive process migration as the

vehicle for dynamic load sitribution. In 2002, Moshe Bar forked from MOSIX a

similar project called openMosix.

In 2006, the implementation of MOSIX2 was completed. It allows clusters to in-

terconnect, creating thus a multi-level system where clusters can host gest processes

originated in other clusters. Some care was taken to ensure guest processes were mi-

grated away from a cluster being disconnected off of the federation. With this move,

MOSIX intends to support metacomputing and organizationalgrids.

• Kerrighed Started in INRIA by Christine Morin in 1998, it was thought tobe an

object-oriented distributed operating system for clusters. It is similar to MOSIX

in both its objective and the fact that it is an extension to Linux, but the internal

architecture is completely different. Kerrighed featuresDistributed Shared Memory

with sequential consistency, so applications work as though they were running in

a single computer even if different threads are running on disparate nodes. It also

features an SMP SSI, and process migration between nodes.

One of the key features of Kerrighed is the concept ofContainers, which are objects

used to store information. These objects can reside either in main memory or in the

22

filesystem. Kerrighed also provides support for checkpointing.

• OpenSSI This is the newest of all Linux-based cluster operating systems. The project

started in 2001, and it has gained much boost since 2005. It isimplemented as a set

of tools and patches to the Linux kernel. Its features include distributed filesystem,

shared memory, shared IPC address space, among others. It isbased on theCluster

Infrastructure for Linux project.

A comparison work by Renaud Lottiaux et al. of these three systems shows that in

the most part they perform very similarly, with the exception of some experiments used to

show some optimizations in the Kerrighed system that enhance migration of processes and

the IPC performance after migration.

These systems are different from SAR. Combined with Linux, they become Distributed

Operating System for high-performance cluster computing.They support a Single-System

Image by means of their ability to place tasks in a remote nodeand to migrate those tasks

later on if it becomes necessary. A SAR network supports bothfunctionalities in a natural

way, through the service call interface, and operating systems like these can benefit from

such facility at different levels. However, they would needmany modifications to acknowl-

edge a SAR network and to be able to use the service call interface. Namely, a whole set

of SAR protocols has to be implemented, and it has to be placedas a layer underneath the

mechanisms mentioned.

CHORUS

At a glance, Chorus is based in three main concepts: Actors, Ports and Threads. An actor

is a passive entity, a collection of resources which can be used by running threads. An actor

is bound to a site, which means it’s not migratable. A crash ona site means a total crash of

all its actors. A port is a communication entry/exit point for an actor. Ports can be arranged

in groups and can be migrated between sites, as can threads. Threads are sequential sets of

instructions that run on actors and communicate through ports.

23

The communication mechanism in ChorusOS is based on capabilities and uses the

RPC mechanism with Message Passing. The system has a microkernel architecture with a

Nucleus underneath it all, called Chorus/Mix. In its version3, the authors added a UNIX

emulation layer, so as to make it compatible with what had become the ad-hoc standard.

This project started in 1979 at IRISA, in France. In 1997, SunMicrosystems acquired

the company Chorus Systems. The founders founded a new company, in 2002 called Jaluna

(now VirtualLogix). Today, one can find the ChorusOS microkernel in portable devices

such as Alcatel mobile phones.

CLOUDS

This system features a Shared Memory implementation. All communication must be done

through this mechanism, for there is no support for Message Passing. However, most

message passing libraries (e.g. MPI) contain implementations which are optimized for

shared-memory systems. The Clouds Operating System is based on Distributed Objects,

which are passive entities that can be either persistent or volatile.

The system architecture segregates nodes in three different categories: Workstations,

Data Servers and Compute Servers. Workstations are regularUNIX desktop systems by

which users connect to the greater Clouds system. Data Servers provide the volatile and

persistent storage entities used to store the Objects. Compute Servers are nodes that access

Data Servers for information and provide their processing units for running user threads.

Both Data Servers and Compute Servers run the Clouds native microkernel called Ra after

the Egyptian Sun god.

Clouds is based on the Object-Thread paradigm (much like Sun’s Java), featuring network-

transparent Method invocation. The languages supported are Distributed C++ and Dis-

tributed Eiffel.

24

2K

This is a Distributed Operating System implemented as a middleware layer based on CORBA

that runs on top of various widely available operating systems (such as Windows and

Linux), as well as a native microkernel called Off++. The system is composed of sev-

eral layers, the topmost of which is aReflective ORB. The two most important features of

the 2K Operating System are automatic system reconfiguration and Reflective Middleware.

The latter refers to a layer that can control its level of translucency, giving the applications

control of the information they can receive from the subjacent system. Off++ has been

discontinued; its sequel, which is both a redesign and rewrite, is called Off++v2 or Plan B.

The fourth edition of Plan B runs on top of Bell Labs’ Plan9 Operating System.

1.5 HYPOTHESES

What we want is the OS to be conceptually on top of an SSI provided by the network,

so that all of the related overhead be stripped from it. Addressing malfunctioning nodes

would be left out of the OS as well, since it is no longer its responsibility to support an

SSI. An OS could be oblivious to the global structure of the system, just being interested in

the local node it is responsible for, and whether a request could be handled locally or not.

Lastly, load balancing and node selection for any purposes would be a responsibility of the

network.

1.5.1 REQUIREMENTS

We have identified the following requirements regarding theexpected functionality from

the network:

• Load distribution.

• Resource selection (structural or functional) for a request.

25

• Discarding defective resources.

• Handling resource changes.

In turn, the OS has the following requirements:

• Access to local hardware.

• Access to the network.

• Create an application environment in which an application is unaware of the network.

• Optionally, provide services to the network.

• Decide which requests from the application can be handled locally, and which can

not.

• Forward requests to the network

The first two requirements coincide with the first requirement previously listed in Sec-

tion 1.2. The third requirement corresponds to the second inthat list. The fourth require-

ment mentions a capability which is present in any networkedsystem. Lastly, the fifth

requirement is an addition, but if we consider the overhead created by this responsibility,

we can claim it is negligible.

The requirements for the model can be enumerated thus:

• A fully transparent system should be presented for the OS to use.

• Network topology and implementation details should be hidden from the OS.

• The means for the nodes to get access to any resource or functionality should have

nothing to do with any node’s network address.

• Node failures should be transparent to the rest of the nodes.

• Node addition/removal should be transparent to the rest of the nodes.

• There should be transparency regarding which node is selected to serve a request.

26

1.5.2 PROPOSEDNOVEL APPROACH

Two solutions are proposed in this work to be conjuncted in the resolution of these prob-

lems. First, a new routing paradigm. Second, the submissionof all resource discovery and

allocation, and load balancing amongst them, away from the nodes and into the intercon-

nection fabric. Thus we arrive to our two hypotheses, which are central to this work:

The first hypothesis reads that resources, from now on refered as services, should be

addressed independently and individually by type, regardless of the nodes responsible for

them. Using this scheme, nodes will lose all identity and theinterconnection network will

now have two purposes: to physically interconnect nodes, and to logically interconnect

services. The advantage claimed is a clean architecture where components can be inserted

and removed seamlessly, and a platform where applications can use a clean interface for

recurring to all resources, local and remote, without goingthrough identifying the nodes

through which access them.

The second hypothesis is that it is natural to push all distributed resource management

intelligence to the network, with the help of said routing paradigm, by imposing the relevant

burden into the switching elements. The claim is that it helps nodes to both gain efficiency

and lose complexity. This is achieved regardless if it is a pure hardware implementation or

general-purpose processors running software, by reducingtheir responsibilities.

By following these two new lines of thought, a careful interconnection network design

should yield a system that fullfills the requirements exposed above and in sections 1.1 and

1.2.

27

CHAPTER 2

SERVICE ADDRESSROUTING

The main function of a network for tightly coupled distributed systems is to enable commu-

nications between nodes by moving packets between source nodes and destination nodes

[24]. Traditionally, source nodes specify the physical addresses of destination nodes ac-

cording to some protocol (e.g. IP, Ethernet). The extra functionality provided by SAR is

a means to select destination nodes without directly knowing their physical addresses, but

instead by requesting some functionality. All a sender is required to know is aService Ad-

dress, which differs from a Node Address in that it is neither unique to a node, nor a node

has to have a unique one.

2.1 DEFINITION

A Serviceis an abstraction used to represent capabilities that can beprovided by individual

nodes to the rest of the system. In this manner, when an application requires some function-

ality to be available, it has two choices: it can either provide it by itself, or be able to find

the address that was assigned to it in the system. For the sakeof comparison, Figure 2.1(a)

shows how traditional physical-address routing is done, while Figure 2.1(b) depicts ser-

vice addressed requests. A network that incorporates SAR issaid to be intelligent, and is

therefore called an Intelligent Interconnection Network (I2N).

There are two main kinds of services: those that are providedstatically by a node, and

those that are delegated to the network dynamically by applications. The first kind can be

used by nodes that wish to provide services that by some reason may be tightly linked to that

specific node. The second kind of services, however, are delegated by nodes for one of two

reasons: either they are shared libraries implemented using the SAR paradigm, or they are

28

(a) Traditional Routing (b) Service Address Routing

Figure 2.1: Comparison of SAR and Routing based on Node Address

meant to be used by the same node later during runtime. In any event, services of the second

type are allocated by the network, which can then balance both the allocation of services

and requests. Services that are distributed by the network for a particular application are

called Remote Procedures (RPs)1, and can be used to provide solutions that use concurrency

as a means to accelerate execution, when enough nodes are available.

2.1.1 SARAND THE OSI MODEL

We have discussed the trend to push functionality away from the application into lower

layers, sometimes creating new layers for that purpose. As more functionality makes it

further down, far away from the applications, the latter just need to account for the interface

presented to them, which is the means to provide and request services. The Operating

System will present the applications with an interface thatprovides the functionality they

need, while underneath performing presentation-layer activities. The reasoning to put this

layer in the OS is to allow heterogeneity and extensibility.Although the network can

perform these activities, it can only be configured so far as the protocols allow, whilst if

the OS does it, any machine can take advantage of the network,as long as it has a suitable

presentation layer implemented.

The reason for having an application layer also at the DOS level is that the latter can as

1Not to be confused with the UNIXTM RPC, that calls Remote Procedures in explicitly named nodes.

29

well provide and request services to and from the network. Examples of such services are

those intended for distributed storage, load balancing, and task migration. In summation,

what the middleware layer will end up doing is just provide servers and translate among

different encoding systems. In turn, the network will be implementing protocols for inter-

active sessions, transmission, QoS guarantees, resource location, and routing, among other

possibilities.

2.1.2 RESOURCEMANAGEMENT

SAR is a powerful paradigm that facilitates the implementation of certain distributed re-

source management tasks. In this section, an outline is presented showing how some of

these functions can be realized at a very low cost. Recall that access to remote process-

ing elements is inherent to the SAR paradigm. In fact, the access is achieved by proce-

dure/service name as opposed to physical address. A fair load distributor is hence critical

to the proper initial services allocation. Because SAR subsumes such a load distribution

function in the network, the first reliable resource management function is already assumed

present. In the following, a brief discussion of other resource management functions is pre-

sented.

RESOURCEDISCOVERY

The ability to find available resources capable of renderinga required service is called

Resource Discovery[13]. When using SAR, services can be registered at power up or at any

other time during system operation. The whole system is aware of service availability as

soon as a node registers its service(s) on the network. Resource Discovery is incorporated

into the system through an application interface. It informs the applications of available

services described in the associated ST of the application’s network port.

30

LOAD DISTRIBUTION AND BALANCING

Load distribution is automatically performed by the network at the time of pairing service

requests with specific nodes. This concept can be extended toprovide load balancing by

giving the network the ability to migrate services available in the nodes. Load Balancing

is accomplished by allowing the network to initiate serviceand/or RP migration depending

on a load balancing criterion. The latter may be sender initiated or demand driven[10].

Note that the load state of the nodes can be made available to the network load balancer

using local self-monitoring functions in the microkernel of each node’s kernel.

POWER MANAGEMENT AND ONLINE REPAIR

Power management is a critical issue in a large class of embedded systems, especially

those in airborne vehicles. A SAR network can manage the power consumption as it can

be given the ability of turning on or off individual nodes according to the dynamic service

and performance needs of the applications[39]. Nodes whoseservices are not being uti-

lized can be turned off and awakened only when their servicesare required. Also, when

speedup calls for the deployment of services among a large number of nodes, dormant

nodes can be awaken to allocate more resources to the critical application while allowing

them to sleep under normal conditions. This idea can be generalized to provide on-line

repair/maintenance of system nodes.

2.2 PROPOSEDIMPLEMENTATION

Before designing an implementation of a SAR network, some assumptions have to be made.

First, and without loss of generality, assume that only nodes connect to ports of the switch

(network), and that the services provided by each of the nodes are associated to the corre-

sponding network port. Thus, given services “are availablethrough” given specific ports.

Also without compromising generality, the whole network can be said to be implemented

31

as a single switch, so that any switch organization would be considered the internals of that

“global” switch. Finally, the network protocols used as a base for the system are assumed

to have a physical addressing scheme.

2.2.1 A SERVICE-ADDRESSROUTING SWITCH

This section shows the process of creating a SAR network. Instead of building it from

scratch, the idea is to start with a basic Ethernet switch, inorder to show that a SAR

network can be obtained from just a few modifications. The reason to select Ethernet was

none of quality or performance, but popularity. After all, any layer-2 switch would have

sufficed.

The first modification to the switch is a table local to each port, implemented in Content-

Addressable Memory (CAM). This Service Table registers allservices available through

the associated port. When a request is issued, a matching is carried out simultaneously

by all ports. When a port determines that its associated nodecan provide the service,

a flag is raised. There is a mechanism to select one of the affirmative repliers and the

criteria for selection can vary widely. Note that the selection criteria can be based on node

utilization, load distribution etc. Hence the claim that resource management functions can

be transferred to the network itself. To effect the routing,two strategies can be followed:

either the network imprints the source and destination physical addresses in the packet, or

a record is kept in a table associating the request packet with the original node’s port or

address. The choice of which strategy to implement depends on a decision towards the

compromise between speed and flexibility.

The services a node is capable of rendering can be assigned statically at system power up

or dynamically allocated, say by applications, at run time according to the computational

needs of running procedures. An application can request that a number of instances of

a particular service be replicated among the system nodes for the purpose of distributing

the load and gain speedup. It is clear that such a request, as well as the set of instances

32

Figure 2.2: A two-level 3x2 LCAN

created for it, is a single job. In this scenario, a global table is kept to record the load per

node, the instances that are assigned to each node, and what jobs those instances belong to.

This table is called the Remote Procedure Table (RPT). With this table, communications

between instances of a same service can be effected by specifying the destination (instance,

job) pair. Furthermore, if an instance is moved to a different physical node, the switch just

needs to update the table and the instances can keep communicating seamlessly.

Note that through this enhancement process, the traditional routing capabilities (as-

sumed to be a link-layer protocol) have been untouched, so that both link-layer traffic and

SAR can coexist. Even more, SAR can be mounted on top of it, thus instead of replacing

a link-layer scheme, SAR provides a network-layer set of protocols that can coexist with

others. The choice of where exactly to place SAR protocols inthe OSI layered model

depends on how tightly coupled should the system be, and whatlevel of flexibility and

interoperability is required.

2.2.2 A HIERARCHICAL SAR SWITCH USING LEAST COMMON AN-

CESTORNETWORKS

Several switches can be interconnected in a hierarchical structure to provide scalability

and reliability. The hierarchical structure chosen for this example is the Least Common

Ancestor Network (LCAN[32]).

33

THE LEAST COMMON ANCESTORNETWORKS

Depicted in Figure 2.2 a LCAN provides a means to create a multistage network with

distributed knowledge. At a glance, a LCAN is a network in which switch levels are in-

terconnected by a well defined inter-level wiring pattern. Each switch connects only to

switches on the levels immediately higher and immediately lower. The ports in a LCAN

switch are labeled “uppers” and “downers”. Upper ports connect to switches in a higher

hierarchical level, while downers connect to switches in a lower level. Downers keep rout-

ing tables, while uppers do not. When a node is not available through any downer, it is

sent through any available upper. If there is no higher levelin the hierarchy, the packet can

be dropped. The main advantage in using LCANs is that each module needs little knowl-

edge of the global topology to perform effective routing. Another advantage of using an

LCAN resides in the number of hops needed to reach a destination; it grows proportionally

to the logarithm of the number of nodes in the system. Note that it is the inherent LCAN

characteristics that enable system scalability.

AN LCAN ETHERNET SWITCH

To produce an LCAN Ethernet switch, few modifications must take place, and they relate to

making each switch behave as a module in a more complex structure. Each module has to

know which ports to consider as uppers, and which to consideras downers. Uppers will be

stripped from their Routing tables, as they are needed no more. When a packet’s destination

is not to be located by a particular module, it will have to be sent upwards through any

available upper. If there are no nodes connected to uppers, instead of forwarding the packet

through the uppers, it will be needed to broadcast it throughall downers.

A SAR LCAN SWITCH

This section describes the three additions that need to be made to each LCAN Ethernet

Switch in order to perform SAR and become a SAR LCAN Switch (Figure 2.3). First,

34

Figure 2.3: Overview of a SAR Switch Design

a Service Table (ST) will be linked to each downer of every switch. Second, a Remote

Procedure Table (RPT) will be created for each swich in the hierarchy. Lastly, a Message

Table (MT) will be created to keep track of messages in transit at any given time. The

following sections describe these additions in detail.

The Service Table. Each port will have a content-addressable Service Table (ST) to

store Service Addresses, and each one will have an associated MAC Address. Each time a

request for a service is received, all STs will be simultaneously queried for that key. If no

affirmative response is obtained from any table, an upper is selected and assigned to the full

message. If there are several affirmative responses, some criterion will be applied to choose

one of them and the switch will proceed as though only that response was generated. When

there is an affirmative response, the message will be assigned to the selected port. If the

switch is in the lowest hierarchical level, the corresponding MAC address for the service

provider will be imprinted as the destination address for all packets in the message. Lastly,

if the switch is in the highest hierarchical level and there is no affirmative response, all

packets in the message will be dropped and a “service-not-available” packet will be sent to

the requestor.

The Remote Procedure Table. Each switch will have a content-addressable Remote

Procedure Table (RPT). It will store a function of Job IDs andInstance numbers, and each

entry will have an associated MAC Address. Each time a request for a RP is received, the

distributed RPT will be queried for that key. If no positive response is obtained, the full

35

(a) Ethernet Header (b) SAR Header (right)

(c) The Control Field (d) The Destination Structure

Figure 2.4: SAR Protocol Packet Header Fields

message is dropped and a “service-not-available” packet will be sent to the requestor. If no

error occurs, the appropriate MAC address will be imprintedas the destination address for

all packets of the message, and the message will be associated to the port through which

the provider is reachable.

The Message Table. A Message Table (MT) is included in each switch. It is imple-

mented as a hash table, where keys are a function of message numbers and originating

MAC addresses, and values are ports and destination MAC addresses. Each incoming SAR

packet is checked against the MT. The destination MAC address, if non-zero, is imprinted

as the destination address in the packet, and the packet is sent through the recorded port.

As the only packets that cannot be found in the MT are the first packets of each message,

single-packet (trivial) messages can never be registered.Last packets of non-trivial mes-

sages cause the deletion of a record in the MT.

2.2.3 PROTOCOLS

A SAR protocol stack must be implemented for the nodes, and the switch modules be

designed to respond to this set of protocols. This section shows a layer-3 example that

allows for routing trivial messages to be routed. It will be refered to as SAR-3. In order

36

to tell SAR-3 packets from any other protocol’s, an identification number must be chosen

from an unused range. Layer-2 headers (e.g. Ethernet) usually provide a field to store this

number. Figure 2.4 shows both protocols’ headers. The following paragraphs explain each

of the fields present in the SAR-3 header.

Control Field. This 16-bit structure holds a 4-bit number and an unused bit field for

future use by control bits (Figure 2.4(c)). This number holds the type of packet (ToP). The

values for that field are listed in table 2.1.

ToP Value Type of Packet

0 Request.1 Reply.2 Register.3 Unregister.4 Set as Busy.5 Set as Available.6 Instantiate RP.7 Collect RP.8 RP Communication.

Table 2.1: Type of Packet (ToP) meaningful values

The remaining values are reserved for possible extensions.

Number of Instances. This field specifies a number in the range 1. . . 65536, which

indicates how many instances that need to be created. This field is only valid when the

packet is an “Instantiate RP” packet.

Destination/Request ID. The Destination address is stored within; it can be a Service

Address or a Destination Structure. The Service Address is assigned to a service at the time

it is registered with the network. The Request ID is the Number of the packet that carried

the request for which this reply is sent. A Destination Structure (Figure 2.4(d)) has two

fields: Job ID and Instance ID. The Job ID is created when a new Instantiation process is

carried out, and the Instance ID is provided by the requesting Instance.

Packet Number. Every node imprints this 32-bit number in all packets generated. This

number is intended to keep track of the number of SAR packets sent to the wire. Every

37

time a new packet is generated, the count is incremented by one. The purpose of this field

is to distinguish among different service requests that come from the same node and to

recognize duplicate packets.

Payload Size. This field holds a number in the range 0. . . 65535, which is the number of

bytes in the packet after the last field of this header. Evidently, in the case of (Fast) Ethernet

the maximum number this field can hold is 1486. In Gigabit Ethernet Jumbo Frames, this

field can contain numbers of up to 8986.

2.2.4 OVERHEAD ANALYSIS

Overhead can be classified into two main classes: spatial andtemporal. Spatial overhead is

the one related to adding physical modules of any kind (e.g. extra circuitry and storage re-

quirements), while temporal overhead is related to the increase in processing time required

to perform a task. To implement SAR, two additions are made tothe switch: a Service Ta-

ble (ST) distributed amongst the ports, and a global Remote Procedure Table (RPT). Their

sizes are both relative to the number of services provided inthe system. The difference

between the tables is that the former is devoted to keeping track of static services, while

the latter’s domain comprises the RPs in the system. Both tables grow linearly with the

number of services of each kind.

All ST fragments in the switch are accessed once per issued request. Thus, there can be

at mostN simultaneous accesses to each ST, whereN is the number of nodes in the system.

In the worst case, each port’s ST needs to handle a stream of requestsN times as fast as the

stream allowed by each port’s bandwidth. A distributed switch, where ports are grouped in

smaller switches in turn interconnected in a hierarchical structure (Figure 2.2), can reduce

the previous figure tolog N . Nonetheless, with such a hierarchical structure, in orderfor

each port to preserve the property of knowing all services underneath it, its corresponding

ST fragment will need to grow exponentially with the level itbelongs to. In an environment

where the service address space is not of negligible size, its implementation could become

38

unfeasible or unnecessarily expensive. There is, however,the option of having caches of

fixed size in each port.

The temporal overhead incurred by the RPT is similar. The number of requests it will

receive at a given time can be as high asN , for there areN ports in the switch. Thus,

the requirements of access speed for this table are as high. The same hierarchical switch

architecture can be used to reduce the demand tolog N .

2.2.5 ROUTING ALGORITHMS

There are three types of node-initiated communications that need to be routed differently:

requests, repliesandupdates.

Upon reception of a service request, a switch will see if its subnetwork is capable of

providing the service. If the service exists within it, the service request will be forwarded

to a parent, announcing an estimate of the server’s workloadthat includes the current re-

quest; otherwise, the request will be forwarded as received. Workload is simply defined

as number of processes in the work queue. The parent will in turn check if there exists

another node that can provide the service faster. This process is continued until the service

request message reaches a root switch, or when a parent cannot do any better than its child;

in either case, a message will be sent to the child telling it to choose its best node for the

service. The child will then forward the request through theport that can route to the node

with the best execution time.

A service reply is the confirmation of a service completion. it might be accompanied

by return data needed by the requestor. Service replies are routed to the requestor using its

physical address, which may not be known by the server.

Every time a node’s workload changes, either due to a new service request arrival or to

its completion, the node sends an Update message to its parent switch announcing its new

workload, its own Node ID and the ID of the Service that will beserved or just finished.

When a downer receives an Update message, three actions are taken: first, it updates the

39

Field Description

Service ID A unique Service identifier used for the search.Node ID An identifier of the best processor in the port’s subentwork

that can provide the service specified in Service ID.Workload Last-known amount of work had by the node specified in

Node ID.

Table 2.2: Fields in a Service Table.

estimated workload values for all the entries associated tothe same Node; second, the

switch compares entries from all its STCs with that specific Service ID in order to estimate

a new best node; and lastly, it announces the chosen node, along with its estimated workload

and the Service ID, to the parent switches using Update messages forwarded through all

upper ports. This procedure repeats until all root switchesare reached and the all Service

Table Caches have been updated.

2.2.6 KNOWLEDGE DISTRIBUTION TECHNIQUES

The main source of information for the in-network intelligence is the Service Table. In a

hierarchical switch (or any distributed switch), that table must be distributed. Each downer

within a switch contains a fragment. Each entry has specific data that is used to calculate

intelligent routing decisions for resource location and load distribution. Table 2.2 shows the

information found in each entry and a description of each field. Two ways of distributing

the Service Table are used, as proposed in section 2.2.4, leading to two service lookup

algorithms:

• Level-Global Knowledge. In the original knowledge distribution, each downer must

store information about every specific service type that exists in its subnetwork. The

number of Service Table entries grows exponentially for each higher level of the

LCAN, and all switches have a complete view of the subnetworkbeneath them. Thus,

for a switch in levell, the table in each downer will haveT = Sul−1 entries, whereS

is the maximum number of different services a node can provide, andu is the number

of uppers per switch. The advantage of this model is that, when a request arrives at

40

a switch, it can be unequivocally determined if a service provider can be located in

that subnetwork. This exponential growth of the local tables stops when the number

of entries equals the number of different services in the system.

• Level Caches. In an effort to alleviate the exponential table growth in a system with

many different services, a design is proposed where all ST fragments have the same

number of entries. One advantage of this model is that all switches are the same

size and have the same complexity regardless of the level. The drawback of this

model is that services are more likely to not be found in higher-level switches and

the replacement of entries is more likely to happen, especially for small cache sizes.

It is important to maintain the most important entries in thecaches; therefore, the

least-recently-used method is used to decide which entriesare to be replaced. The

algorithm behaves the same as Level-Global Knowledge for the most part; however,

when there is a cache miss, a message is sent through all available downers recur-

sively, asking the children to send updated information about that particular Service.

A request will not wait for the update to take effect; it will instead propagate upwards

until a suitable server is found or the request process is stopped and rescheduled.

Therefore, the update will only benefit later requests.

2.3 A PROGRAMMING PARADIGM FOR SAR APPLICATIONS

The purpose of this section is to introduce a set of service-oriented programming primitives

that can help application development for SAR networks. Clearly, being a SAR system a

set of computers interconnected by an interconnection network, virtually any programming

interface that is able to translate between application requests and the appropriate network

protocols is a possible candidate; nonetheless, the primitives presented here are especially

tought for applications to take advantage of the simplicityof the Service-Oriented Archi-

tecture that SAR delivers.

41

2.3.1 A SERVICE-ORIENTED PROGRAMMING MODEL

In a Service-Oriented environment, there are two basic operations: Registering and Re-

questing. The first operation is the process where a service signing up to the lookup entity

anouncing its availability, and is usually initiated by theservers as is the case in SAR. The

second operation is the process where a client application tells the service they are being

needed.

In SAR, there are other kinds of activities in which nodes andthe network can engage.

Following is a list of primitives that correspond to these activities and a brief explanation

of their meaning.

REPLY

. When a server has finished processing, it needs a means to inform the client about this

fact. The reasons are two: the client might be waiting in order to continue its activities, and

there may be information the server wants to give it in response. These are the purposes of

this primitive.

GET_REPLY, AWAIT_REPLY

. Clients that request services may want to be informed of thetime when the service

rendering is completed. Additionally, they may want to receive data in return such as

computing results or status information. These primitivesprovide with nonblocking and

blocking options for the clients to do this.

BUSY

. If a node realizes its load meets its capabilities, or that it cannot receive eny more requests

of one or any services, there should be a way to inform the network to stopp clogging this

particular node and to look for alternative servers to fulfill further service requests.

AVAILABLE

42

. This is meant for nodes that realize they can keep receivingservice requests to inform the

network of this fact. It is the oposite of the previous primitive.

STOP

. This can be used by a node when it wants to stop providing one or all of its services. It

will immediately cause the network to start deregistering all appropriate instances assigned

to that node.

INSTANTIATE

. It is used when an application wants to publish a service. The network will start placing

copies if the service as requested, and any application willbe able to make use of this

service in the future. This is a convenient way to deploy shared libraries using SAR. Also,

it can be used by applications to request the propagation of Remote Processes (RP’s) so that

it can use multiprogramming and multiprocessing when available to satisfy its problem-

solving needs.

COLLECT

. This is used by applications that want to withdraw the services once propagated via

INSTANTIATE.

CONNECT, DISCONNECT

. When applications want to establish an interactive session with a service, they can use

these primitives to setup and tear down such sessions.

SEND, RECEIVE

. This is for applications to communicate during interactive sessions.

43

Figure 2.5: Process of adapting a solution to SAR2.3.2 SARIN CONCURRENTCOMPUTING

Taking advantage of Service Address Routing in concurrent computing is, from the user’s

point of view, a straightforward procedure that resembles using other techniques or libraries

(e.g. MPI). The functionality of SAR is hidden behind the user interface, providing all

the benefits of SAR without the user’s knowledge or intervention. Both data parallel and

control parallel programs can take advantage of SAR. The following steps (Figure 2.5) can

be taken:

1. Choose an algorithm for the problem. To take advantage of concurrency, select the

most scalable parallel algorithm that solves the particular problem.

2. Divide the solution into independent, self-contained modules if possible. If control

parallelism (MIMD) is used, the algorithm is then divided into smaller and distinct

functional modules. If data parallelism is used (SPMD), similar functional modules

are created to operate in the distinct problem partitions. With SAR, it is possible to

use any one of the two kinds of parallelism, or a combination of both. In any event,

the communication patterns have to be defined and specified inthis step. These

modules will become the RPs. If there is no parallelism in thealgorithm, the whole

algorithm is wrapped in a RP, so that independent problem instances can be solved

concurrently using the SAR network.

44

Figure 2.6: Skeletons of a Remote Process (left) and a Stub (right)

Figure 2.7: Matrix Multiplication

3. Make a stub that creates the RP instances and binds them with their initial data based

on the problem instance. When the partial results are returned, they are merged into

the final solution.

Communication patterns between instances are indicated through the Send and Receive

primitives. Once the portion of the solution assigned to an instance is computed, it can be

returned to the requestor by a Reply primitive. When all instances of RPs are no longer

needed, they can be destroyed by using a Collect primitive. The skeleton of a stub and that

of an RP are presented in Figure 2.6.

2.3.3 EXAMPLE : MATRIX MULTIPLICATION

Figure 2.7 highlights matrix multiplication’s dependencyon the dot product operation. In

the data parallel algorithm chosen, there are as many independent dot product operations

as there are rows in A. Initially, each dot product receives both a row of A and a column

of B. One can fix the rows of A, while cyclically shifting the columns of B to align with

45

the rows in each dot product iteration. If each row of A generates an instance of the dot

product process, the column is passed to the instance that held the column with immediately

lower index, wrapping around at the edge. When a dot product has iterated as many times

as there are columns in B, all the elements of the result row would have been calculated.

The elements can be reordered by each thread to form the rows,and the rows can be later

reordered to form the resulting matrix.

Each RP performs the algorithm just described. The stub is a procedure that breaks the

matrices (A into rows, B into columns), instantiates as manyRPs as needed, requests the

services, and collates the resulting rows to form the C matrix. The pseudocode for both the

RP and the stub are as follows:

Matrix_Multiply_RP (P, Q, i, M, N)

Begin

j = i

Do

R[j] = P*Q

j = (j+1) % N

if j == i break loop

Send Q to Instance (i-1)%M of job MY_JOB

Receive Q from job MY_JOB

Loop

Reply R

End

Matrix_Multiply (A, B, M, R, N)

Begin

q = Instantiate Matrix_Multiply_RP times M

For i = 1 to M

req[i] = Request Instance i of Job q on \

Row(A,i), Col(B, i%N), i, M, N

Next i

For i = 1 to M

Row(C, i) = Await_Reply from req[i] of Job q

Next i

Collect q

Return C

End

46

2.3.4 EXAMPLE : 2D FAST FOURIER TRANSFORM

The 2D Fast Fourier Transform (FFT) can be calculated by applying the 1D FFT to all rows

of the input matrix, and then again to all the columns of the resulting matrix. Alternatively,

it can be achieved by applying the FFT to the rows, transposing, applying FFT to the rows

again, and finally transposing once more. Assuming that a 1D FFT procedure is already

available, the 2D FFT can be described by the following pseudocode:

For i = 1 to N

FFT Row(A, i)

Next i

Transpose A

For i = 1 to N

FFT Row(A, i)

Next i

Transpose A

Thes last transposition step can be skipped and performed inthe stub, if it is more

convenient. The following pseudocode shows how that procedure can be implemented

using our SAR primitives:

FFT_2D_RP(P, N)

Begin

FFT P

For i = 1 to N

Send P[i] to Instance i of job MY_JOB

Receive P[i] from Instance i of job MY_JOB

Next i

FFT P

Reply P

End

FFT_2D(A, N)

Begin

q = Instantiate FFT_2D_RP times N

For i = 1 to N

req[i] = Request Instance i of Job q on Row(A, i)

Next i

47

Figure 2.8: LU-Decomposition

For i = 1 to N

Col(B, i) = Await_Reply from req[i] of Job q

Next i

Collect q

Return B

End

2.3.5 EXAMPLE : LU-DECOMPOSITION

Even though this is a purely sequential procedure, SAR can beused to provide concurrency

for solving more complex problems (e.g. differential equations). As mentioned previously,

there is no need for parallel algorithms to exploit concurrency. In this example, an LU-

Decomposition RP is created, so that any number of instancescan be solved as much

concurrency as the node count and system load allow.

LU-Decomposition (Figure 2.8) factors a matrix (A) into twotriangular matrices (L and

U):

A = LU, aij =j−1∑

k=1

likukj

The typical algorithm works by first assigning ones to the main diagonal of L, and

then producing and solving the equations for the remaining elements of L and those of U

according to Crout’s method[30]:

lii = 1, lij =aij −

∑j−1k=1 likukj

ujj

48

u1j = a1j , uij = aij −j−1∑

k=1

likukj

The implementation used calculatesL andU in a single resulting matrix, where the

upper triangle corresponds toU and the lower triangle corresponds toL. Due to the fact

that the main diagonal ofL is filled with ones, the main diagonal in the result corresponds

to that ofU . A routine can be written to calculate both factors, and thenbe made an RP

like thus:

LU_Decompose_RP(A, N)

Begin

For k = 1 to N $-$ 1

For j = k + 1 to N

A[k][j] = A[k][j] / A[k][k]

For i = k + 1 to N

A[i][j] = A[i][j] $-$ A[i][k] * A[k][j]

Next i

Next j

Next k

Reply A

End

Lastly, the stub would just have to spawn a new RP each time a new decomposition is

needed, and retrieve the results to present them in a user-understandable format. Thus, the

stub code could be as follows:

LU_Decompose(A, N)

Begin

q = Instantiate LU_Decompose_RP

req = Request Instance 1 of Job q on A, N

LU = Await_Reply from req of Job q

Collect q

Return LU

End

49

2.3.6 SERVICE ADDRESSROUTING PROGRAMMING INTERFACE

In previous sections we defined the programming model, explaining that the Service-

Oriented paradigm was preferred, because we are in fact using a Service-Oriented architec-

ture overall. In this section, that model is brought-down toa specific interface that can be

implemented easily. We start by establishing comparison patterns withthe programming

interface for Inter-Process Communications (IPC) in UNIX and arrive to the conclusions

that, with only a few extensions, it is as fit an interface as a specifically tailored one would

be; with the added advantage that it is not only a widely-, well-known interface, but also

simple and used by many applications. But that’s not all: it is, in fact, a Service-Oriented

interface.

THE SOCKET INTERFACE BASICS

Invented and first implemented as part of the BSD Inter-Process Communications (IPC)

mechanisms, sockets implement the service-oriented programming paradigm in a very flex-

ible and yet extraordinarily simple programming interface. The other IPC mechanisms

include signals, arenas (shared memory), pipes and named pipes (fifos). However many,

all of these mechanisms by themselves only serve for processes communicating within a

single host. Sockets, nonetheless, allow for processes in different machines to commu-

nicate across networks via a bidirectional pipe that hides all necessary network protocol

implementation, data verification and packet reordering.

Thesocket interface provides functions, constants and protocols. The way it is im-

plemented, it is an interface that connects from one end to the applications by means of the

socket API, and to the network protocol implementations on the other end through the

low-level network protocol interface internal to the the OSkernel.

Following the UNIX traditions, sockets work like files from the user’s perspective. This

means that they can be opened, read from, written to, and closed. In a nutshell, an appli-

cation requests a service from a “server”. What happens afterwards, depends on the type

50

Figure 2.9: Correspondence between the UNIX socket interface and the proposed SARprogramming Model

of commnunications being effected: they can be connectionless or connection-oriented.

In the first case, which is the simplest, the server will receive one bulk of data at a given

time. There is no particular ordering in the reception of such bulks of data. After the server

has processed the request, it will optionally send a response message back to the requestor

following the same scheme. Although communications can be carried out in this manner

indefinitely, they are considered unreliable, for no algorithms are implemented to ensure

reliable delivery of packets. In connection-oriented communications, however, reliable de-

livery is ensured, and connections are established so that random bidirectional messaging

can take place between the requestor and the server. This scheme is appropriate for many

situations, but the overhead makes it also inappropriate for some others.

CORRESPONDENCE BETWEEN SOCKETS ANDSAR

The basic operations in the SAR paradigm are (1) Registeringwith the network, (2) Re-

questing a service, (3) Replying to a requestor and (4) De-registering from the network. As

51

is shown in Figure 2.9, there is a direct correspondence bertween these activities and the

functions provided by thesocket interface. The following is a list of needed semantics

for a SAR interface, and an explanation on how to use the thesocket interface to achieve

it.

Registering a service can be achieved by means of thebind() system call. The

purpose of this system call is to assign a service orport to a particular application, so

that messages addressed to such port would correspond to thecaller application. In other

words, it assigns a serving application to a service identifier. By allowing this system call

to register services with the network, we are not altering the intended semantics. Together

with listen() andaccept(), a SAR server can be set up in a very short number of

steps like a tcp/ip or a udp/ip server would.

Requesting a service can be implemented as asend()/recv() pair of system calls,

at the requestor and the server respectively. The server, once bound to a port or service, can

block in arecv() call waiting for clients. Clients can start sending messages (requests)

via therecv() call.

Replying to requestor is also asend()/recv() type of activity. The requestor can

either block waiting for a response or intermittently checkfor its existence. The easiest way

to differentiate between a service port (one bound by a server to receive service requests)

and a request port (one established by the system for the requestor to receive the server

reply) is to divide the port address space into two independent ones.

Deregister a service is a simple operation that deletes a service provision instance from

all tables so that it is no longer directible to a particular server. This is easily achieved by

theclose() system call. Its original purpose is to disconnect a socket from its associated

entities; it is natural thus to think it’s orthodox to assignit deregistering semantics.

52

COMPLEMENTING THE SAR APPLICATION INTERFACE

The interface as described so far is enough to build a functional SAR system. However,

there are other desired characteristics listed in a previous section that will help make the

network more complete and useful for concurrent computing as already exemplified. The

following list explains how these primitives can also be implemented using thesocket

interface.

Set node/service as busy/available is a setting that describes the state of one or all

services. For this reason, it is natural to specify it as a socket option that can be set and

unset via thesetsockopt() system call.

Instantiate a series of Remote Processes is an operation that can involve creating mul-

tiple processes in as many as the same number of distinct nodes, and creating sockets so that

they can communicate to solve the task at hand. This is an involved procedure that cannot

be attributed to any system call already present, so the addition of aninstantiate()

system call is proposed for this effect. Similarly, aclose() operation is too specific for

us to expect it to destroy the Remote Procedures. Adding it such semantics would have

disadvantages, including the lost ability to close a particular socket in the set, and over-

complicating a very lean and simple system call. For these reasons, a new system call

collect() is proposed as the counterpart forinstantiate().

Interactive communications are commonplace in thesocket interface. Thus, func-

tions for establishing them and carrying them out are present and well known. These system

calls can be adapted for the SAR protocols easily. They includeconnect(), close(),

read(), write(), send(), recv(), and some others.

2.4 INTELLIGENT NETWORKS ONCHIP

Service-Address Routing is a communications mechanism that can be implemented at dif-

ferent scales, depending on the application. Amongst others, a SAR can be used for hot-

53

reconfigurable hardware systems, standardized component interfacing, computer clusters,

grid computing, and many other systems. In the previous section, we proposed an im-

plementation for computer clusters; now, we will sketch an implementation at a different

scale: Networks on Chip. Such a system is called IntelligentNetwork on Chip (INoC). In

this section we will develop a circuit-switched INoC LCAN Switch.

The intelligence of the network is implemented in the switches. Conceptually, all

switches are identical, although for implementation purposes they can be customized for

capacity, simplicity and/or speed, according to the level.

2.4.1 OVERVIEW

The process of building the circuit has three phases: up, down and back. The up phase

starts when the originating IP launches a REQUEST (QUE) signal, and it lasts while a

server can’t be found. When a switch is in this phase, it just forwards the query through

the next available upper. The down phase starts when a switchfirst finds a provider that is

available. Then, the QUE command is forwarded through the downer that provides the path

to that provider. When it reaches the provider, it starts theback phase by acknowledging

the request and sending CONNECT (CON) into the network, thatspans the same switches

in reverse order. When the originating IP receives a CON primitive, the circuit is com-

pleted and ready to be used. If at any moment when building thecircuit, the search cannot

continue, the switch that notices generates a RELEASE (REL)command that propagates

backwards through the incomplete circuit, cancelling the process and making network re-

sources available.

Inside the switch are three kinds of modules (Figure 2.10): acentral crossbar; a Service

Table (ST) that maps services to downers; and several Link Interfaces (LI), one for each

upper and downer. The crossbar can be pruned to not permit upper-to-upper connections

because they are not possible in the routing scheme of an LCAN; the same as well for self

connections. Further, the crossbar is enhanced to produce two signals (OK and NG) that

54

indicate to both ends that a requested connection could or could not be established.

When a LI, either upper or downer, receives a query (QUE) command from its exter-

nal connection, it references the Service Table (ST) to see if the requested service can be

accessed through any of this switchs downers. If so, the LI requests a connection through

the crossbar to that downer; otherwise, it requests a connection to the next available upper.

Note that if the requesting LI and the requested LI are both uppers, no such crosspoint will

exist in the crossbar, which will indicate NG back to the requesting LI. If the request is

valid but the requested LI is in use, the crossbar will also indicate NG; otherwise, it will

make the connection and indicate OK. If the requesting LI receives NG from the crossbar,

it returns a release (REL) command on its external connection.

Except for downers at the leaf level and uppers at the root level, a LI is always connected

through a point-to-point connection to a LI in another switch. So, if a QUE command is

received from within the switch (through the crossbar), theLI just forwards the QUE to its

peer. Two LIs are peers if they share a point-to-point connection.

2.4.2 THE SERVICE TABLE (ST)

The Service Table (ST) is separated into modules, each corresponding to a downer of the

switch (Figure 2.11). Each ST module contains ROM, PLA, and/or CAM circuitry that

responds to the Service Address(es) associated with its downer.

In operation, only one LI at a time will process an external QUE command. At that

time, the LI applies the requested Service Address to the ST’s Query bus. Each ST module

determines if its downer serves that address, asserting or negating its input to the Address

Generator (Figure 2.12). The latter is simply a prioritizing circuit that selects one of the

asserted lines, asserts the output corresponding to that input, and negates the rest of the

outputs. This produces a Linear Address where only one ofD (number of downers) bits

may be asserted.

Similarly, the linear address of the next available upper isgenerated by applying the

55

Figure 2.10: Overview of an INoC Switch

Figure 2.11: The Service Table

Figure 2.12: The Address Generator

56

BUSY signals from each upper to another address generator.

2.4.3 THE L INK INTERFACE (LI)

In the case of an external QUE, the purpose of the LI is to search the ST, set up the cross-

bar and forward the QUE across the crossbar to another LI. In case of an error, it is also

responsible for generating a REL command back to its peer. Inthe case of an internal QUE

(from another LI across the crossbar), it forwards the QUE toits peer.

Figure 2.13 shows the Link Interface and its connections to its peer, to the ST via its

Service Table Interface (STI), and to the crossbar via its Switch Interface (SI).

THE SERVICE TABLE INTERFACE

The STI (Figure 2.14) is the piece of logic that connects the Port’s main logic to the Service

Table. When gets a QUE from the Logic, it applies the requested Service Address to the

STs Query bus. The ST replies with either the linear address of an appropriate available

downer, or with the address of the next available upper. If there is none of either, the ST

returns all zeros.

If the downer address is non-zero, it gets forwarded back to the Logic; otherwise, the

next available uppers address gets forwarded. The bits of the address sent to the Logic are

all ORed together. That indicates to the Logic whether the service was found or not.

THE SWITCH INTERFACE

The SI (Figure 2.15) is the circuitry that connects the Port’s logic to the crossbar. It receives

the following commands from it: QUE (with service address),connect (CON), or REL. The

action then depends on the command received:

• QUE. The SI saves the address in a register and sets up the crossbarto connect to

the LI whose address was indicated by the Logic. If the crossbar returns NG, then

57

Figure 2.13: The Link Interface

Figure 2.14: The Service Table Interface

Figure 2.15: The Switch Interface

58

the SI sends a REL command back to the Logic. If the crossbar returns OK, the SI

sets a Connection Flag to keep that connection alive and sends a QUE (with service

address) command through it.

• CON. There is already an address saved in the Destination Register. The SI sets up

the crossbar to that address and sends a CON command through that connection. If

the crossbar returns NG, then the Connection Flag is clearedand a REL command is

sent back to the Logic.

• REL. The command gets forwarded through the switch.

Alternatively, the SI can receive connections initiated across the crossbar. This is the

producedure followed in this case:

1. The crossbar sends signal (SIG) to the SI to indicate that another SI is trying to

communicate with it. The circuit wires are loaded with the address of the initiating

SI and the command to be performed: QUE or CON.

2. The SI saves the address in the Destination Register and forwards the command to

the logic.

3. The case of a REL command is a special one, because there already is a connection.

One of the wires activates that command, and causes the SI to forward the REL

command to the Logic and to drop the connection with the crossbar.

THE PORT’ S LOGIC

Depicted in Figure 2.16, this is what controls the port’s data and control flow; the other

modules are simply interfaces. If it receives a QUE from the peer, it gets forwarded to the

STI. If the service is found, then the QUE gets forwarded to the SI; a REL command is

sent to the peer otherwise. However, if the QUE is received from the SI instead, it gets

59

Figure 2.16: The Link Interface Logic

Figure 2.17: The Link Interface Put Together

60

Figure 2.18: The INOC LCAN Switch’s Mutilated Crossbar

forwarded to the peer. If a CON or REL is received from either the peer or the SI, it gets

forwarded to the opposite end.

Figure 2.17 shows the three blocks of the LI put together. Note that only the uppers have

the BUSY line from their SI going to the ST.

2.4.4 THE CROSSBAR

In Figure 2.18, we describe how the crossbar for this INoC LCAN switch is implemented.

It looks like a regular crossbar switch has been modified to allow only those connections

valid in an LCAN switch, thus the following restrictions apply:

1. No self-connections are allowed.

2. No connections are allowed between two uppers.

3. Indicates back to the initiator whether the connection isavailable (OK) or it was

already in use (NG).

4. Indicates to the receiving end that a connection is being held (SIG).

61

CHAPTER 3

EXPERIMENTS AND ANALYSES

Throughout this document, many things have been claimed andmany others, proposed.

The purpose of this chapter is to identify those items that have been introduced in previous

chapters without a better justification than a design choiceand provide a quantitative or

qualitative analysis to support them, along with the hypotheses presented on Chapter One.

The following section presents the layout of this chapter byenumerating these items and

showing how and where is the issue addressed.

3.1 RATIONALE

Service-Address Routing is a vast new field that opens a plethora of posibilities. There is

much analysis to perform and many questions to answer. The main goal of this work is

to introduce SAR, as well as to present with a system architecture that will take advantage

of its most basic and intrinsic capabilities to aid high performance computing using tightly

coupled distributed systems. This chapter will analyze each of the following aspects in the

manners described, in order to provide a solid base to justify the system architecture laid

down throughout this work. The following subsections will elaborate on how we expect to

achieve this.

3.1.1 A FULLY-DISTRIBUTED SERVICE-ADDRESSEDARCHITECTURE

The first hypothesis, and the most important, is that relatedto Service Address Routing.

It is about the choice to addressing services instead of nodes. Section 3.3 focuses on this

issue by comparing Jini (which can be considered an application-level implementation of

a different Service-Address architecture) against a hypothetical architecture where each

62

node performs the translation between service and node addresses internally (without re-

curring to a centralized lookup service. This architecture, nicknamed “SAR-Middleware”,

is further explained in Section 3.3.1.

3.1.2 SAR IMPLEMENTATION WITHIN THE NETWORK FABRIC

This is the second hypothesis. To implement the service-to-node translation into the net-

work fabric itself (i.e. inside the switch’s firmware), as the final step of this two-step process

of converting a traditional network into a SAR network. To analyze its effect, section 3.3

has a set of experiments that compare the “SAR-Middleware” architecture against the SAR

architecture presented in this work.

3.1.3 THE PROPOSEDSAR PROGRAMMING PARADIGM

There is barely any quantitative way to measure the benefits of a programming paradigm

versus another. Hence, the analysis of this programming paradigm will be performed by

comparing to similar code snippets that solve the same problems using alternative systems,

such as Unix RPC, Java RMI and MPI. This is done in section 3.2.

3.1.4 THE PROPOSEDSAR SWITCH IMPLEMENTATION

A switch implementation has to be justified using a cost/performance analysis. It is not

sufficient to make it perform faster; the cost of the enhancement has to be worth it. And if

the implementation is cheap, it still has to perform at a level acceptable for its cost bracket.

Section 3.4 presents such an analysis.

3.1.5 THE USE OFFIXED-LENGTH SERVICE TABLE CACHES

The Level Cachesarchitecture was proposed as a cheap alternative to the exponentially-

growing Level-Global Knowledge. Their behavior is simulated and their performance is

63

analyzed in section 3.3.

3.1.6 THE USE OFL IMITED QUEUES

Allowing a processor to queue as many processes as it wants, without a hard limit, is

a possible glitch that can unbalance the system, thus producing an overall performance

detriment. However, if many processors reject service requests based only on their queue

sizes, that can produce a significant amount of excess network traffic. This could lead to

a good deal of avoidable CPU idle time. This is simulated in section 3.3, in the set of

experiments titled “Limiting the Queue Size”.

3.1.7 THE USE OF ANLCAN I MPLEMENTATION OF BIG SWITCHES

The use of a hierarchical structure for big networks was proposed as a means to create

a scalable structure for arbitrarily big networks. The LCANarchitecture was selected for

that purpose. Section 3.3 presents a series of experiments under the label “Composite” that,

along with the cost/performance analysis in section 3.4, justify these choices.

3.2 ANALYSIS OF SAR’S PROGRAMMING MODEL

The purpose of this section is to show that using the simple programming paradigm and

interface, introduced in section 2.3, the user can harness the potential of SAR without wast-

ing time and effort learning an arcane programming interface. We will start by showing a

very simple, well-known problem, and how the process of designing and implementing a

solution works in SAR. We will then repeat this process usingother client-server technolo-

gies, such as the UNIX RPC, Java’s RMI, Jini and MPI. Finally,a section will follow where

we descuss our impressions from this exercise.

64

3.2.1 TCP COMPLEXITY ANALYSIS

There is a rule of thumb that, for each Gigabyte per second of bandwidth a node needs

to support, the processor must have one more GHz of clock rateonly to process the TCP

headers in real time. Even though this metric is as loose as itcan be, currently available off-

the-shelf microprocessors perform similarly enough to make the use of these units feasible.

The Transport Control Protocol (TCP) is a Transport-Layer set of protocols that sits

on top of IP. It is used by the Internet to guarantee reliable message delivery. Inside its

implementation, one can find that over 140 functions are called from within the protocol

input function. If we compare that against the 30 used by UDP,we can notice a difference

by a factor of almost 5. The reason is that most of these functions are meant to solve

problems that are not applicable to tightly-coupled cluster systems; instead, these complex

problems are mainly encountered when a protocol is routed through virtually any number

of different networks.

One would not like to use UDP for these systems either, because it provides unreliable

communications without any delivery guarantees. However,it is so simple a protocol that

it represents a good measurement of what the lower bound could be to any transport-layer

protocol. We proposed an implementation in the previous chapter that by only addressing

those problems prone to be encountered in the systems we are studying, incurs an overhead

close to that imposed by UDP.

Furthermore, by not requiring any node addressing scheme, one can do without the IP

protocol altogether and implement SAR right on top of the Data Link layer. If we consider

that the processing of IP packets is even more expensive thanprocessing UDP packets,

we can acknowledge an important piece of overhead that will no longer be imposed to the

communications.

65

3.2.2 IMPLEMENTING A PARALLEL ALGORITHM USING SAR

The matrix multiplication algorithm was discussed in section 2.3.3. The implementation

strategy shown uses Remote Procedures, which is an advancedfeature of SAR. In this

section, we will implement the algorithm using simple non-interactive service requests.

The assumptions are that there is a service for dot product running in at least one node and

that we know the address for such service. The main philosophy will be the same: one

matrix will be transposed, row vectors will be sent to the servers, and the resulting matrix

cell values will be returned. Note the assumption that replies received through the same

socket are received by the application in the order their respective requests were made.

Being the native language for UNIX programming, the C language will be used for the

SAR version.

CLIENT IMPLEMENTATION

#include <socket.h>#include <netsar/sar.h>

void Transpose_Matrix( int m, int n, double A[m][n], double B[n][m] ){

for( int i = 0; i < m; ++i )for( int j = 0; j < n; ++j )

B[i][j] = A[j][i];}

void Matrix_Multiply(int m, int r, int n,double A[m][r], // First matrix to multiplydouble B[r][n], // Second matrix to multiplydouble C[m][n] // Matrix to store the result

){

// Some variables we’ll be needingint s;int i, j;double Bprime[n][r];struct sockaddr_sar Server_Address;

// Prepare the service address structureServer_Address.ssar_len = sizeof Server_Address;Server_Address.ssar_family = AF_SAR;Server_Address.ssar_ssar_addr = ADDRESS_FOR_DOT_PRODUCT;

// Prepare the sockets = socket( PF_SAR, SOCK_SEQPACKET, AF_SAR );connect( s, (struct sockaddr *) &Server_Address,

Server_Address.ssar_len );

// Transpose the second matrixTranspose_Matrix( n, r, B, Bprime );

66

// Send the vectorsfor( i = 0; i < m; ++i )

for( j = 0; j < n; ++j ){

send( s, A[i], sizeof(A[i]), 0 );send( s, Bprime[i], sizeof(Bprime[i]), MSG_EOR );

}

// Receive the resultsfor( i = 0; i < m; ++i )

for( j = 0; j < n; ++j )recv( s, &C[i][j], sizeof (double), MSG_WAITALL );

// Cleanupclose( s );

}

SERVER IMPLEMENTATION

#include <socket.h>#include <netsar/sar.h>

void Dot_Product_Service(){

// Some variables we’ll be needingint s, i, n, k;double A[100], *B, C;struct sockaddr_sar Server_Address, Reply_Address;

// Prepare the service address structureServer_Address.ssar_len = sizeof Server_Address;Server_Address.ssar_family = AF_SAR;Server_Address.ssar_ssar_addr = ADDRESS_FOR_DOT_PRODUCT;

// Prepare the reply address structureServer_Address.ssar_len = sizeof Server_Address;Server_Address.ssar_family = AF_SAR;Server_Address.ssar_ssar_addr = 0;

// Create the socket and register the services = socket( PF_SAR, SOCK_SEQPACKET, AF_SAR );bind( s, &Server_Address, Server_Address.ssar_len );listen( s, 5 );

// Wait on, and process requestswhile( 1 ){

k = recvfrom( s, &A[0], 100*sizeof (double), MSG_WAITALL,(struct sockaddr *) &Reply_Address, 0 );

n = k / (2*sizeof (double));B = &A[n];C = 0;

for( i = 0; i < n; ++i )C+= A[i] * B[i];

sendto( s, &C, sizeof C, MSG_EOR,(struct sockaddr *) &Reply_Address, 0 );

}}

67

3.2.3 UNIX RPC IMPLEMENTATION

Programmatically speaking, the main conceptual difference between SAR and UNIX RPC

is that UNIX RPC requires the client to know exactly which host it will connect to, and

to be able to get its physical address. The UNIX RPC implementation will also use the C

language, thus it will look somewhat similar:


#include <rpc/rpc.h>#include <rpc/xdr.h>#include <rpc/types.h>#include <netconfig.h>


for( int i = 0; i < m; ++i )for( int j = 0; j < n; ++j )B[i][j] = A[j][i];

}


){

// Some variables we’ll be needingint i, j, k;double arg[2*r+1];double Bprime[n][r];char* Servers[k] = { "Server 1", "Server 2", ..., "Server k" };

// Transpose the second matrixTranspose_Matrix( r, n, B, Bprime );

// Send the vectorsfor( i = 0; i < m; ++i )

for( j = 0; j < n; ++j ){

arg[0] = r;for( int k = 1; k <= r; ++k ){

arg[k] = A[i][k];arg[r+k] = B[j][k];

}

rpc_call( Servers[(i*2+j) % k], DOT_PRODUCT_PROGRAM,DOT_PRODUCT_VERSION, DOT_PRODUCT_FUNCTION,xdr_vector, arg, xdr_double, &C[i][j],NULL );

}}


#include <rpc/xdr.h>

68

XDR* Dot_Product_Callback( double* A[] ){

int i, r = (int) A[0];static double Result;double* B = &A[r];

for( i = 1; i <= r; ++i )Result+= A[i] * B[i];

return (XDR*) &Result;}

void Dot_Product_Service(){

registerrpc( DOT_PRODUCT_PROGRAM, DOT_PRODUCT_VERSION,DOT_PRODUCT_FUNCTION, Dot_Product_Callback,xdr_vector, xdr_double );

svc_run();}

3.2.4 JAVA RMI I MPLEMENTATION

Java RMI is a variant of UNIX RPC tailored specifically for theJava programming lan-

guage. Conceptually, it implements the same idea, which is the invocation of a remote

procedure sending arguments and receiving values upon return. However, the interface to

Java RMI is friendlier and more flexible than that of UNIX RPC.Beside the client and

server implementation, this time we need also to define the interface through which the

former entities will be communicating. Evidently, this implementation will use the Java

programming language; it is a little longer, but easier bothto realize and to understand.

INTERFACE DESCRIPTION

import java.rmi.*;

public interface Dot_Product extends Remote {public double Multiply( double A[], double B[] )

throws RemoteException;}


import java.rmi.*;import java.rmi.server.*;

public class Matrix {double[][] M;int m, n;

public int Rows() { return m; }public int Columns() { return n; }public double[] Row( int i ) { return M[i]; }

69

public double Get( int i, int j ) { return M[i][j]; }public void set( int i, int j, double v ) { M[i][j] = v; }

public Matrix( int _m, int _n ){

M = new double[_m][_n];m = _m;n = _n;

}

public Matrix Transpose(){

Matrix Result = new Matrix( n, m );


Result[i][j] = M[j][i];

return Result;}

Matrix Multiply( Matrix B ){

// Some variables we’ll be needingMatrix Result = new Matrix( A.Rows(), B.Columns() );char* Servers[k] = { "Serv1", "Serv2", ..., "Servk" };Matrix Bprime = Transpose(); // Transpose the second matrixSystem.SecurityManager( new RMISecurityManager() );

try {for( int i = 0; i < m; ++i )

for( int j = 0; j < n; ++j ){

Dot_Product ro = (Dot_Product)Naming.lookup( Servers[(i*2+j) % k] );

// Invoke the Remote Methoddouble[] r1 = M[i];double[] r2 = B.Row( j );double Value = ro.Multiply( r1, r2 );Result.Set( m, n, Value );

}} catch( Exception e ) {} // Some error occurred

return Result;}

}


import java.rmi.*;import java.rmi.server.*;import java.rmi.registry.*;

public class Server extends UnicastRemoteObject implements Dot_Product{

public Server( int i ) throws RemoteException{

Super();System.SetSecurityManager( new RMISecurityManager() );

try {Naming.rebind( "Serv" + i, this );

} catch( Exception e ) {} // Some error occurred}

public double Multiply( double A[], double B[] )

70

throws RemoteException{

int Result = 0;

for( int i = 0; i < A.length; ++i )Result+= A[i] * B[i];

return Result;}

}

3.2.5 JINI IMPLEMENTATION

The Jini implementation of the distributed matrix multiplication operation is the more con-

voluted of the set. This is to be expected, as it is reliant on the Java RMI platform.

INTERFACE DESCRIPTION

import java.rmi.*;

public interface Dot_Product extends Remote {public double Multiply( double A[], double B[] )

throws RemoteException;}


import net.jini.core.entry.Entry;import net.jini.core.lookup.ServiceTemplate;import net.jini.core.lookup.ServiceRegistrar;import net.jini.core.discovery.LookupLocator;import net.jini.lookup.entry.Name;import java.rmi.RMISecurityManager;

public class Matrix {double[][] M;int m, n;

public int Rows() { return m; }public int Columns() { return n; }public double[] Row( int i ) { return M[i]; }public double Get( int i, int j ) { return M[i][j]; }public void set( int i, int j, double v ) { M[i][j] = v; }

public Matrix( int _m, int _n ){

M = new double[_m][_n];m = _m;n = _n;

}

public Matrix Transpose(){

Matrix Result = new Matrix( n, m );


Result[i][j] = M[j][i];

71

return Result;}

Matrix Multiply( Matrix B ){

// Some variables we’ll be needingMatrix Result = new Matrix( A.Rows(), B.Columns() );Matrix Bprime = Transpose(); // Transpose the second matrixSystem.SecurityManager( new RMISecurityManager() );

try {LookupLocator Lookup =

new LookupLocator( Some_Well_Known_Address );ServiceRegistrar Registrar = Lookup.getRegistrar();Entry[] Server_Attributes = new Entry[1];Server_Attributes[0] = new Name( "Dot_Product" );ServiceTemplate Template =

new ServiceTemplate( null, null,Server_Attributes );

for( int i = 0; i < m; ++i )for( int j = 0; j < n; ++j ){

Dot_Product ro = (Dot_Product)Registrar.lookup( Template );

// Invoke the Remote Methoddouble[] r1 = M[i];double[] r2 = B.Row( j );double Value = ro.Multiply( r1, r2 );Result.Set( m, n, Value );

}} catch( Exception e ) {} // Some error occurred

return Result;}

}


import net.jini.core.entry.Entry;import net.jini.core.lookup.ServiceID;import net.jini.lookup.entry.Name;import com.sun.jini.lookup.ServiceIDListener;import com.sun.jini.lookup.JoinManager;import com.sun.jini.lease.LeaseRenewalManager;import java.rmi.Remote;import java.rmi.RemoteException;import java.rmi.RMISecurityManager;import java.rmi.server.UnicastRemoteObject;

public class Server extends UnicastRemoteObjectimplements Dot_Product, ServiceIDListener

{private ServiceID ID;

public Server( int i ) throws RemoteException {}public void SerivceIDNotify( ServiceID _id ) { ID = _id; }

public double Multiply( double A[], double B[] )throws RemoteException

{int Result = 0;

for( int i = 0; i < A.length; ++i )Result+= A[i] * B[i];

72

return Result;}

}

3.2.6 MPI IMPLEMENTATION

It was discussed in section 1.4.7 how MPI is a message-passing facility with very specific

goals, and how PVM is another such facility with different goals. However, the program-

mer’s interfaces from both platforms are similar; in fact, one can find much correspondence

between their functions and types. In this section, we will only show an MPI implementa-

tion to the Matrix Multiplication process for three main reasons: it is more widely adopted;

it is an international standard; and finally, because by showing an MPI implementation, a

PVM counterpart becomes evident. It must be noted that neither MPI nor PVM use the

Client/Server model; hence, in this case, the implementation will consist of a single pro-

gram:

#include <mpi.h>

int nProcesses;int My_ID;

// This program assumes:// 1. MPI_Init() was already invoked.// 2. That the number of processes is already determined.// 3. That each process knows its process id.// 4. That the slave processes are inside Dot_Product().// 5. That the master process invokes Matrix_Multiply().// 6. That MPI_Finalize() will be invoked when appropriate.



B[i][j] = A[j][i];}

void Dot_Product(){

// Some variables we’ll be needingint n[3]; // Number of elements, Row, ColumMPI_Status status;

while( 1 ){

MPI_Recv( &n[0], 3, MPI_INT, 0, MPI_ANY_TAG,MPI_COMM_WORLD, &status );

double A[n[0]], B[n[0]], C = 0;

MPI_Recv( &A[0], n, MPI_DOUBLE, 0, MPI_ANY_TAG,MPI_COMM_WORLD, &status );

73

MPI_Recv( &B[0], n, MPI_DOUBLE, 0, MPI_ANY_TAG,MPI_COMM_WORLD, &status );

for( int i = 0; i < n; ++i )C+= A[i] * B[i];

// This assumes the master is 0MPI_Send( &n[0], 3, MPI_INT, 0, 1, MPI_COMM_WORLD );MPI_Send( &C, 1, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD );

}}


){

// Some variables we’ll be needingint n[3];int i, j, p, Elements;double Bprime[n][r];

// Transpose the second matrixTranspose_Matrix( n, r, B, Bprime );

// Send the vectorsn[0] = r;

for( n[1] = 0; n[1] < m; ++n[1] )for( n[2] = 0; n[2] < n; ++n[2] ){

p = (n[1]*2 + n[2]) % nProcesses;MPI_Send( &n[0], 3, MPI_INT, p, 1, MPI_COMM_WORLD );MPI_Send( A[n[1]], r, MPI_DOUBLE,

p, 1, MPI_COMM_WORLD );MPI_send( Bprime[n[2]], r, MPI_DOUBLE,

p, 1, MPI_COMM_WORLD );}

// Receive the resultsElements = m * n;for( i = 0; i < Elements; ++i ){

MPI_Recv( &n[0], 3, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG,MPI_COMM_WORLD, &status );

p = (n[1]*2 + n[2]) % nProcesses;

MPI_Recv( &C[n[1]][n[2]], 1, MPI_DOUBLE, p,MPI_ANY_TAG, MPI_COMM_WORLD, &status );

}}

3.2.7 DISCUSSION

We have implemented the matrix multiplication algorithm using five different facilities.

The choice of that particular algorithm was not because of performance; any modern pro-

74

cessor can evaluate the dot product operation in less time than it takes to send the data.

Rather, it obeyed to the fact that it is a well-known operation, which is very simple to

implement, and which allows for embarrassingly parallel implementations like the ones

shown.

By looking at these small pieces of code, one can actually contemplate the strenghts

and weaknesses of each paradigm, as well as their philosophy. For instance, it is easily

noticeable that the UNIX RPC implementation will work sequentially; the client will block

at each function call and wait for the remote procedure to complete execution before con-

tinuing. In order to make it take advantage of any concurrency, multiple threads must be

generated in the client to block independently of each other. The Java RMI implementa-

tion is way more flexible and more easily readable; however, it is cumbersome and there is

no concurrency as well. Another disadvantage is that the programmer has to use the Java

programming language, so the program must run in a Java Virtual Machine.

The Jini implementation has an advantage over the two previous systems: there is no

need to know where the service is located. One just needs to query the Lookup service

for a server with the given characteristics, and an appropriate server is found. Nonetheless,

the problem of there not being concurrency still lurks, and it isn’t fair to require the client

to enforce concurrency. Further, the implementation is more convoluted than that which

uses Java RMI and the libraries and runtime environments that support Jini are immense.

Needless to say, this is still bound to the use of the Java programming language and a Java

Virtual Machine to run both the clients and the servers.

MPI uses a completely different paradigm from the previous.A program will run, and

the MPI runtime environment will replicate it either locally or remotely (the program need

not know). Each replica, along with the original copy, is assigned a unique process iden-

tifier among that set. There are a number of advantages to thismodel; amongst others: i)

the master process does not have to keep a record of what nodeshold the service, ii) the

communications primitives are varied and powerful, and iii) concurrency is automatically

75

achieved and limited by the number of processes created. However, there are disadvantages

to this model; for instance, it is virtually impossible the deployment of shared libraries as

a process that is completely independent from the application development, and there is a

set of libraries and a runtime platform necessary to use thisfacility.

Finally, we come to criticize the implementation using SAR.It is indeed quite lengthy,

but each line is perfectly readable and its purpose is easy tounderstand. Concurrency is

automatically achieved when the network redirects the requests towards diferent nodes. Al-

though there is no concurrency within a single server, this is easilly corrected by dedicating

a different thread of execution to process each request received and to send the results back

to the requestor. The burden of achieving concurrency wouldnot befall on the client, but

in the server, which is much more advisable. Also, the need had by UNIX RPC and Java

RMI to know which nodes are capable of providing the service does not exist in this im-

plementation. There is no lookup service to which the clientneeds to connect. At no point

does the client know which node processed a request and produced its result. Finally, the

architecture is by definition using the Client/Server paradigm without needing of a runtime

platform or a third-party set of libraries. Just by implementing SAR straight into the socket

interface and the Operating System facilities to process network traffic, one can harness the

full potential of SAR using a small, well-known, programming interface.

3.3 EXPERIMENTAL ANALYSIS

To analyze the proposed system architecture’s behavior, a series of tests are carried out

using an event-driven simulator built for this particular purpose. These tests are grouped

in a series of different experiments meant to show diverse aspects of the system’s behavior

and how it compares to other families of systems. The following subsection describes with

very little detail the overall architecture of the simulator and those of the systems that were

built on top of it. The subsections that follow describe the experiments that were carried

76

Figure 3.1: The Simulator’s Core

Figure 3.2: The Simulator’s Modular Structure

out, their purpose and their outcome. Finally, we perform a discussion of the results, as a

preamble to the concluding chapter.

3.3.1 EVENT-DRIVEN SAR HIERARCHICAL NETWORK SIMULATOR

A minimalistic yet very flexible event-driven simulator wasbuilt for the purpose of evalu-

ating SAR’s behavior. It is comprised by an engine, a set of entities, an event queue and

an action associated to each type of event (Figure 3.1). The event queue is a priority queue

ordered by time, from soonest to latest, of the events that have been generated and are pend-

ing to happen. The entities are representative of structural components that are present in

the system to be simulated, i.e. the resources. The approachfollowed to implement this

simulator is object-oriented; hence, both entities and events are represented by abstract

classes from which the actual types of entities and events will inherit, and will depend on

the actual system being simulated.

The simulator has a multi-layered design and a multi-moduleimplementation as shown

in Figure 3.2. We will now provide an explanation of each layer, starting from the bottom

77

Figure 3.3: The Simulator Implementation’s Class Hierarchy

and working upwards.

A H IERARCHICAL NETWORK REPRESENTATION

The whole purpose of this simulator is to evaluate differenthierarchical networks, so it is

natural that its first conceptual layer should be a representation of such networks. The re-

sources conforming this layer are: Port, Switch Port, Switch and Node. They are contained

in a single module called Resources. On top of this layer lie the three architectures studied

in this work, as could others if plugged in the future for further analysis. Please refer to

Figure 3.3 for the class hierarchy.

JINI SIMULATOR

The Jini algorithms are very simple both to describe and to implement. In a nutshell, there

is a Lookup server that keeps track of the services nodes are capable of providing, along

with the node status. There is the concept of alease, which is a permission given to a node

to use some node’s service; the Lookup service will assign leases fairly, so that load is

distributed evenly and there will be no starvation or overloading of servers. When a node

produces a service request, it first looks at its own Lease Table to see if there is a suitable

server assigned to it. If there is, and the lease hasn’t timedout, the request is sent directly

78

to that server; if there isn’t, a lookup process will be started, and the request will back off

until the Lookup service replies.

In order to simulate the Jini architecture, a very simple setof additions is necessary. We

need to include a module with the algorithms, implemented asevents, called Jini-Events.

There are two more modules, called Jini-Resources and Jini-Message. The resources used

by Jini are the ones from the layer below, with the addition ofa Lookup Node and its

Lookup Table. Further, each regular Node is equipped with a Lease Table to keep track of

the leases, assigned by the Lookup Node, and their deadlines.

SAR-MIDDLEWARE ARCHITECTURE AND SIMULATOR

In order to separately test the two hypotheses presented in Chapter 1 separately, an interme-

diate architecture was developed. It was calledSAR-Middleware, because it resembles the

SAR philosophy implemented within the nodes —perhaps as a middleware layer—, instead

of within the switching elements of the interconnection network. It can be regarded both as

a hybrid and as an intermediate step between the Jini architecture and the SAR architecture

presented in this work. It can also be regarded as a fully-distributed version of Jini.

The lookup algorithm works similarly to that of Jini, but being there no centralized

Lookup service, each node has to broadcast the lookup request in order to receive a global

view of the system. However, a complete broadcast for each lookup request would be

too expensive to only think about; hence, we relied on a partial broadcast mechanism that

will deliver a message or a connection to a node if and only if all the resources needed

(i.e. crosspoint switches, point-to-point links and tableentries) are available to use at once.

Thus, a lookup request will not necessarily be received by all nodes in the system, but

instead for those that have a frank path to the requestor. Allnodes from those that can

provide the service, will reply granting the requestor a lease. However, they cannot know

if the requestor will take the lease from them or from anotherserver, so they will not keep

track of it. When a node produces a service request, just likewith Jini, it will check for a

79

suitable lease that hasn’t timed out; if found, the service request will proceed directly to the

server; if not found, the lookup process will be triggered. When a node launches a lookup

request, it has no idea as to when a reply will come, so the request is backed off a random

amount of time that grows logarithmically with the number ofattempts to request.

To implement this architecture, the lookup table has to be broken down, and every node

has to incorporate a portion therein. The result is similar to the Service Table portions that

the nodes have in a SAR system. The simulated nodes are hence similar to the ones from

the subjacent layer, but include this Lookup Table and the Lease Table. In order to keep

the nomenclature homogeneous, the modules implementing this architecture are named

MW-Resources, MW-Events and MW-Message.

SERVICE-ADDRESSROUTING SIMULATOR

The SAR resources, algorithms and mechanisms have been explained with much detail in

Chapter 2. There is a class to represent each of the resourcessuch as Port, Switch, Switch

Port and Node; they all have additions to the ones in the layerbelow as has been described

in the previous chapter. The modules that implement the SAR simulator are called SAR-

Resources and SAR-Events.

THE FINISHED SYSTEM

There is a set of other modules meant for setting up and monitoring the simulator:

• Chron. Takes real-time measurements in order to do performance monitoring and

optimization of the simulator itself.

• Initialize. This module prepares the simulator for execution, so its responsabili-

ties are many. Interpreting the command-line arguments andsetting the simulator’s

optional settings accordingly. Invoking the Layout moduleto setup the network.

Reading-in the services configuration file and generating events accordingly. Invok-

ing the Chron and Signals modules to set the real-time chronometers and the signal

80

handlers, respectively. Reading-in the workload configuration file and generating

events accordingly. Finally, it sets some preliminary statistics using the Stats mod-

ule.

• Layout. This module reads-in the network configuration file, and configures the

network accordingly. It creates nodes and switches of the types specified by the user,

and interconnects their ports also according to the configuration file.

• Log. This module is used to create the Log file. Every major event that occurs during

a simulation is recorded there. For example, a new request was launched, or a reply

arrived to the requestor.

• Signals. This module changes a few signal handlers so that the user canbetter inter-

face with simulations running in the background.

• Stats. This module is for storing a table with all metrics taken during simulation. At

the end of it, the table will be writen to disk into a file calledthe Stats file.

3.3.2 METHODOLOGY

Before setting up the experiments themselves, we need to decide what it is that we need to

analyze, measure and prove. As mentioned in Section 3.1, we have four main purposes for

the experiments, of which we can distinguish the following three categories: i) analysis of

the SAR LCAN’s Scalability, ii) how Jini compares to theSAR-Middlewarearchitecture,

and iii) how theSAR-Middlewarearchitecture compares to SAR. The analysis of the SAR

LCAN Scalability is meant to justify two choices: the proposed hierarchical design using

LCANs, and the use of limited process queues inside the nodes. The second category of

experiments is an attempt to prove the first hypothesis of using a completely distributed ar-

chitecture based on service-addressed communications implemented in a layer beneath the

applications. The third category is meant to prove the second hypothesis, namely the imple-

mentation of that communications facility inside the network fabric components (switches).

81

In order to provide with a quantitative method of comparing results and making con-

clusions, we will provide measurements of the following performance figures: round trip,

response time, cpu utilization, time attempting request, number of request attempts, num-

ber of reply attempts, time attempting reply and time in the server’s queue. These numbers

will be calculated both per request and per job. Per-requestnumbers will give an idea of

the raw performance of the system being tested, but per-job numbers will tell us what the

user actually sees, because it is only when a job is finished that the user can say he’s happy.

The only exception is the cpu utilization, which will be measured for the whole timeslot

being analyzed, instead of per job or request. The numbers for a job will be the maxima

amongst all its composing requests, because by definition a job can only end when the last

request has finished. The most important figures, of course, are the Job Response Time,

the CPU Utilization, and the Job Throughput; they will measure, respectively, the average

happiness level per user, the resource utilization as oposed to resource waste, and the rate

of users happy per unit of time.

The individual experiments are specified by establishing constant and variable system

parameters, and the values they will take. This is achieved by the use of a descriptive table.

The parameters used to describe a test are:

System Configuration:

• Nodes. Number of nodes in the system.

• Shape. In systems using a network hierarchy, this parameter indicates how many

switching stages(L) there are, how many downers per switch (D), and how many

uppers per switch (U). The format is L-DxU.

• Cache Size. The number of entries in each port’s table. It defaults toServices Per

Node(see below).

Services Configuration:

82

• Services. Maximum number of services that will be available throughout the whole

system.

• Services Per Node. The maximum number of different services a node can provide.

Workload Configuration:

• Number of Requestors. Size of the node subset that will be producing service

requests at any point during runtime.

• Avg. Jobs Per Requestor. The expected number of request bursts a requestor will

produce during the system runtime.

• Max. Job Size. Each Job is composed of a number of of requests of the same service

produced at the same time. This parameter limits said number.

• Max. Spacing. This parameter limits the time between two consecutive Jobs, re-

gardless of the requesting node’s identity.

• Max. Processing Time. This is the limit on the CPU time any request can take on

the server.

SAR LCAN SCALABILITY

The first batch of experiments is meant to answer the following questions:

• “How does the Service Table Cache size affect the system’s performance?”

• “Does limiting the process queue sizes in the nodes affect positively or negatively the

system’s performance?”

• “How does having an LCAN hierarchy affect the Scalability ofSAR?”

In order to answer these questions, a set of experiments was devised where sets of

clusters are simulated. The experiments are named acordingly to the question they are

83

Parameter Parameter Value for Test...1 2 3 4 5 6 7 8 9 10

Cache Size 10 20 30 40 50 60 70 80 90 100Shape 5-3x2Nodes 243

Services 100Services Per Node10

Avg. Jobs Per Node 8Max. Job Size 16Max. Spacing 10 seconds

Max. Processing Time 100 seconds

Table 3.1: Experiment Description: Caches

supposed to answer; these names are:Cache Size, Limiting the Queue SizeandComposite

Switch.

Cache Size. This experiment will attempt to prove that by relinquishingLevel-Global

Knowledge (and hence avoiding the possibly exponential table growth), the system per-

formance does not get negatively affected. Remember that Level-Global Knowledge is a

special case of Level Caches (see Section 2.2). The main difference between the two comes

down to the fact that the latter is scalable whereas the former is not. This arises from the fact

that Level Caches has a per-port service table of constant size and Level-Global Knowled

has a table that depends on the current level and the total number of different services in

the system.

In this experiment, the simulated clusters will differ in the number of cache entries per

port. The increase will be linear, and it will range from the number of different services per

node (Level Caches) to the number of different services in the whole system (Level-Global

Knowledge). Table 3.1 indicates the values for the other system parameters, which remain

constant across the clusters.

Limiting the Queue Sizes. Even though theoretically a process queue could grow for

ever, that does not necessarily mean it is a good idea. Without a proper load-balancing

mechanism, the queues could grow ever larger in a node subsetwhile other nodes are

becoming idle. This experiment will repeat the previous one, with the slight difference that

84

Parameter Parameter Value for Test...1 2 3 4 5 6 7 8

Shape 2-16x16 2-16x8 2-16x1 4-4x4 4-4x2 4-4x1 2-16x4 1-256x0Cache Size 10

Nodes 256Services 100

Services Per Node10Avg. Jobs Per Node 8

Number of Reqeustors16Max. Job Size 32Max. Spacing 10 seconds


Table 3.2: Experiment Description: Composite

the queues are arbitrarily limited to 10 process (Q), with a maximum of 4 processes serving

the same type of service (k).

Composite This set of experiments is meant to understand the role of theLCAN Shape

(i.e. the ratio between the number of downers and uppers in the switch, and the number

of levels) in the distribution of jobs, cache information and scheduling. Several LCAN

networks connecting the same number of nodes will be compared against each other, and

then against a single crossbar.

The workload also remains constant across experiments, however its distribution is not.

Four test groups were created, one for each LCAN shape tested. Each group consists of

four tests, one for each distribution of work requests. Table 3.2 specifies the system and

workload configuration.

As can be seen, the set of requests is the same accross all tests; however, they are

performed by a subset of the nodes. This is to analize how the network shape helps or

obstructs distribution of load amongst servers.

JINI VS. SAR-MIDDLEWARE

These experiments will attempt to prove the advantage of having a fully distributed service-

addressed intercommunication facility implemented in a layer underneath the applications.

85

Parameter Parameter Value for Test...1 2 3 4

Nodes 32 64 128 256Max Job Spacing (sec.)100 50 25 12

Cache Size 10Services 64

Services Per Node10Avg. Jobs Per Node 8

Max. Job Size 8Max. Spacing 10 seconds


Table 3.3: Experiment Description: Cluster Size with LightWorkload

For these experiments, we are going to consider a single crossbar switch as the interconnec-

tion network, for two reasons: first, the focus of this work are tightly coupled distributed

systems; second, this way there a hierarchical structure would give Jini a clear disadvan-

tage, which would make it unfair for it. We are going to test two different scenarios: i) how

the systems scale as the number of nodes grows (Cluser Size),and ii) how the system be-

haves when the same exact workload is produced by a reduced node subset. A note has to

be made about Jini experiments: The number of requestors andthe number of servers will

always be one less than the specified in the table. The reason is that Node Zero holds the

Lookup Service, and it cannot request or provide any serviceother than that. This does not

alter the experiment because one less requestor is only a marginal difference that cancels

out by the absence of one server.

Cluster Size. When the system scales in size, we want its processing capabilities to

scale as closely as possible. For this reason, in this experiment we will double the number

of nodes in each subsequent test, and we will also have the newnodes provide the exact

same workload at the same rate, so the system will effectively receive twice the work per

unit of time. What we expect from this experiment is to compare what system deals better

with these changes. Table 3.3 shows the system parameters used for this experiment:

Unbalanced. On a cluster, not all of the nodes have to interface with the users. In fact, it

is not uncommon that some clusters will have a set of machinesdevoted only for workload

86

Parameter Parameter Value for Test...1 2 3 4

Requestors 32 64 128 256Max Job Spacing (sec.)100 50 25 12

Nodes 256Cache Size 10

Services 64Services Per Node10

Avg. Jobs Per Node 8Max. Job Size 8Max. Spacing 10 seconds


Table 3.4: Experiment Description: Unbalanced with Light Workload

processing, while others serve as terminals for the users toenter new tasks into the system.

It is the purpose of this experiment to show how these two systems would behave in such

condition. To simulate this scenario, the number of requestors is halved in each subsequent

test, as shown in table 3.4. We then analyze and compare how these systems behave under

such conditions.

SAR-MIDDLEWARE VS. SAR

These experiments are the next step from the previous. We know beforehand (for varied

preliminary experimental analysis) that the workload usedto compare Jini versus SAR-

Middleware is, in fact, not a challenge for the latter. We also know that it is not a challenge

for SAR either. Thus, we repeated the previous experiment tocompare these two systems;

instead of producing a maximum job size of eight requests, the job size will be limited in

these experiments to thirty-two requests.

3.3.3 RESULTS AND DISCUSSION

Five experiments were carried out. The first three were meantto analyze the scalability

of SAR LCANs; these experiments were namedCaches, Limited QueuesandComposite.

The last two’s purpose is to show how SAR compares against itsclosest cousin in terms of

87

goals: Jini. However, there are two principles to SAR, of which Jini implements the first

at the application level (Service Addressing), and the second not at all (Routing of Service

Addresses inside the network). For this reason, we developed the SAR-Middleware archi-

tecture, which is similar to Jini, but with two differences:service addressing is performed

by the operating system, and there is no centralized lookup server.

During the preliminary experimentation stage of this work,it became clear that Jini and

SAR did not belong in the same bracket; a workload that would seem reasonable for use

with Jini would run fine with SAR, but one that stressed a SAR cluster, or even a SAR-

Middleware cluster, caused so many network conflicts with Jini that the system became

deadlocked. For this reason, we used two different workloadsets: light and heavy. The

light workloads were used to compare Jini and SAR-Middleware, while the heavy were

used for SAR-Middleware and SAR. The following sections show the most relevant graphs

showing the results obtained from those experiments, and discuss what can be deducted

from them.

SCALABILITY OF SAR LCANS

The use of Service Table caches was proposed in Chapter Two asa low-cost alternative

to the exponential cost increase of Level-Global Knowledge. This knowledge distribution

method was called ”Level Caches”. In this set of experiments, we use ten different cache

sizes, growing linearly, where the smallest is the proposedsize of ”the maximum number

of different services to be provided by a single node”, and the biggest is equivalent to

Global-Level Knowledge.

In the graphs shown in Figure 3.4, we can see that requests spend almost 70% of the

response time queued in the server. This is good, because that means the servers are found

quickly. The actual amount of time processes are queued is not so relevant, as it depends

on the actual workload and how it relates to the amount of available resources. What we

are looking for in these graphs is wether Level-Global Knowledge finds servers better than

88

Figure 3.4: Measurements for theCachesexperiment. To the left, statistics were computedper Request. The ones in the right were computed grouping them in their respective Jobs.

89

smaller caches, and if the server selection is more adequate. The average response time

shows to not vary by more than 5% accross all tests. Further, Level-Global Knowledge’s

was not the least. When we turn to see how many requests and jobs were completed during

a small timeframe, we can notice how the number is exactly thesame regardless of the

cache size. This was run in a system with five switching levelsand the LCAN was not

squared, wich proves the cache size is not relevant to systemperformance even for large

systems.

The second experiment, whose results are shown in Figure 3.5, is an attempt to minimize

job response time by avoiding nodes whose process queues arelarger than a parameter Q

and contain k processes of the same service type. Though arbitrary, Q and k are workload-

dependent parameters, because in a heavily-loaded system asmall choice of Q and k would

do nothing but saturate the network with requests that cannot be satisfied. In this case, we

chose Q = 5 and k = 2. By means of these parameters, the average response times improved

21% per request and 17% per job. The reason is that the requesttime was reduce roughly

to half. These numbers are again very similar regardless of the cache size, but this time the

largest cache had the second-to-worst average response time. Request and job throughput

varied a little, but not significantly enough to be visible.

The third experiment’ statistics are in Figure 3.6. Averageresponse time per process

groups the tests in three different groups. The cluster witha single crossbar produced an

average response time of aproximately 120 seconds, while the four-layered non-redundant

tree’s is of 500 seconds. In between, the rest of the systems show very similar response

times: an average of 351 seconds, which is not much higher than the general average of

342 seconds, and a standard deviation of 22 seconds. The LCANshape does not affect

negatively the response time, and the number of levels only produced an increase of aprox-

imately 7%. Job-wise speaking, the response time appears toactually improve for networks

with less uppers per switch. That is probably due to the pollution of Service Table caches

caused by frequent updates of more nodes. The crossbar-based system has an average re-

90

Figure 3.5: Measurements for theLimited-Queuesexperiment. To the left, statistics werecomputed per Request. The ones in the right were computed grouping them in their respec-tive Jobs.

91

Figure 3.6: Measurements for theCompositeexperiment. To the left, statistics were com-puted per Request. The ones in the right were computed grouping them in their respectiveJobs.

sponse time of aproximately a fourth of the rest’s, but this time the non-redundant four-level

tree is much closer to the average.

CLUSTER SIZE

Figure 3.7 clearly shows that even with a light workload Jiniis not scalable. In these

experiments, both job rate and number of nodes was doubled ateach step, representing

two equal clusters with similar workload that had been connected together replacing their

respective switches by a bigger crossbar. A system shows good scalability properties the

response time and request processing throughput are not affected; if they get reduced, then

the system shows great resource utilization. Jini’s response time grows exponentially with

the system, although at a smaller rate. However, SAR-Middleware shows it has no problem

scaling with the cluster size at least using light workloads. With heavy workloads, SAR-

Middleware becomes sloppy, producing response times an order of magnitude larger than

those of SAR. However, even then the response times do not grow exponentially like those

of Jini.

Request throughput doesn’t get negatively affected in Jiniusing light workloads, which

means that proportionally to the workload they are reduced exponentially. That is another

sign of poor or null scalability. Although job throughput appears to scale appropriately,

92

Figure 3.7: Measurements for theCluster-Sizeexperiments. To the left, tests were runwith the light workloads to compare Jini and SAR-Middleware. The ones in the right wererun using the heavy workloads to compare SAR-Middleware andSAR.

93

that is naught but a byproduct of chance. It is impossible to sustain a job completion rate

above a request rate. When using SAR-Middleware, request and job throughputs scale

acordingly to the system scaling rate. With heavy loads, SAR-Middleware’s request and

job throughput are similar to SAR’s, though at the cost of very high latencies.

System runtimes show that Jini’s ability to handle requestswas surpassed after the sec-

ond test, while SAR-Middleware’s had no problem finishing all the requests not much after

the last one was launched shortly before second 22600. For the heavy workloads, SAR

appeared to have a handful trying to process the requests at the same rate they were being

produced in the first cluster, but as the system size grew it was able to do a better request

placement, so the overal throughput could grow to match the request rate. In this matter,

SAR-Middleware did not do as well, but at least its total completion times did not grow

like Jini’s.

UNBALANCED

In these experiments, visible in Figure 3.8, only a subset ofthe nodes produces service

requests, while the rest serve. Theses systems mimic clusters which are not uncommon,

where some nodes are destinated to serve as access terminals. In this case, these access

terminals also provided services, so we have the same numberof servers throughout all

tests. The number of access terminals in each step was doublethat of the previous. This

doubling of the number of requestors represents a halving ofthe response time for Jini. The

same exact cluster produced twice the response times when the same exact workload was

produced by half the requestors. SAR-Middleware’s latencies were affected, but not highly

noticeably. With heavy workloads, nevertheless, it produced better response times with

more requestors than otherwise. SAR appears to need even higher workloads to saturate

like the former did.

Jini shows throughput rates a little higher than SAR-Middleware, and for heavy loads

SAR-Middleware shows the same throughput as SAR. This meansthey all benefited from

94

Figure 3.8: Measurements for theUnbalancedexperiments. To the left, tests were runwith the light workloads to compare Jini and SAR-Middleware. The ones in the right wererun using the heavy workloads to compare SAR-Middleware andSAR.

95

Figure 3.9: Implementation of a full Crossbar

the extra bandwidth. System runtime shows the light workload was too much for Jini,

while SAR-Middleware had no problem with it. The latter did have problems handling

high workloads, but increasing the number of requestors helped much. In all tests, SAR

finished a short time after the last request was launched.

3.4 COST/PERFORMANCEANALYSIS OF SAR LCANS

A crossbar switch with n bidirectional ports requiresn2 crosspoint switches. In fact, this

is a special case of the more general rule, that a crossbar switch with n input ports andm

output ports requiresn × m crossbar points, as shown in Figure 3.9. It is evident from this

fact that using a full crossbar does not scale for systems with large numbers of nodes. It

is well-known that a clos network can reduce such requirement to O(k log(n/k)) switches

with n/k ports by means of a multistage switch array. The number of crosspoint switches

is thereby reduced to

O(n2

klog k). (3.1)

This represents a reduction by a factor ofk/ log(k). It means that, the more switches

we use, the bigger the savings. However, bandwidth is not preserved. Using a squared

LCAN, the complexity requirement is slightly lower, with the added advantage that it has

full crossbar bandwidth.

96

Figure 3.10: The INOC LCAN Switch’s Mutilated Crossbar

An LCAN switch has less connections than a regular crossbar switch. This is due to the

fact that no connections are allowed between two uppers, or to the same port. Consequently,

as Figure 3.10 shows, a squared LCAN switch only has34n2 = 3d2 crosspoint switches.

Thus the full network will have

3ddlogd n logd n = 3dn logd n (3.2)

crosspoint switches, which isO(n log n).

A squared LCAN has ad/u ratio of 1. By increasing that number, we reduce the band-

width in higher layers until we reach the limit. Ad-ary tree is an LCAN withd/u ratio of

d (u = 1). In this case, it is well-known that a complete network withn nodes has

log n−1∑

i=0

id =1 − dlog

dn

1 − d= d2n − 1

d − 1(3.3)

switches, which isO(n). Each of these LCAN switches hasd2 + 2d crosspoint switches,

thence the total cost of suchd-ary tree’s interconnection network isO(n). The drawback is

that for each new level, the bandwidth is reduced by a factor of d.

The complexity added by the Service Table represents a constant increase ofO(t) tran-

sistors, where t is the size of the per-port Cache. If we haveO((d + u) log n) ports in the

97

network, the whole network cost increases to O(dlog n + d logn), which is stillO(n log n).

Recall that the largest valued can take isn, in which case we would have a single crossbar.

Hence, we can safely assume a SAR LCAN will always cost between O(n) andO(n log n).

Given that the savings in the worst-case scenario are given by a factor ofO(n), the

performance/cost ratio of an LCAN-based system performingsimilarly to one using a full

crosbar represents an improvement by that much. This means that system performance

must decrease by a linear factor in order to make this switch composition unjustifiable.

However, the bandwidth is kept regardless of the number of levels, and the latency in-

creases by a factor of the number of levels, i.e.O(log n). This means that even though the

performanceis affected, the more levels the LCAN has, the more cost-efficient the system

becomes.

98

CHAPTER 4

CONCLUSIONS

Service-Address Routing opens a new field of research, so wide that can give research

material to many journal papers and doctoral dissertations. The main focus of this work

was to introduce SAR, justify its basic principles and scrutinize its benefits. It is virtually

impossible to look into a new line of research and define what there is to do in an exhaustive

manner. This chapter summarizes the contributions made by this work and makes note of

several more tasks that need to be looked into in the near future.

4.1 CONTRIBUTIONS

Several achievements have been presented throughout this document. The most important

is that we showed via experimentation the two hypotheses presented at the end of Chapter

One. Service- Oriented Programming has been evolving and enhancing network trans-

parency. Systems like Jini and CORBA implement nodeless Client-Server interfaces at

the application layer. Strictly speaking, they are SAR systems implemented wholly in the

highest OSI layer. This work takes away the centralized Lookup services, sets the applica-

tion layer free of resource discovery and management, and implements them straight into

the interconnection network, away from the computing nodes. This resulting system was

evaluated and compared against Jini, its closest relative.

Other contributions and achievements are enumerated below:

1. The use of Content-Addressable Memories (CAMs) was proposed as an implemen-

tation of the Service Table. This allows an efficient, constant-time service lookup

operation.

99

2. The Least-Common Ancestor Networks were used to create scalable hierarchical sys-

tems using commonly-available crossbars. This choice was carefully evaluated both

experimentally and theorically, using a simulated environment and a cost-performance

analysis, respectively.

3. The use of fixed-size Level Caches, instead of the more costly Level-Global Knowl-

edge, was proposed and analized. The results from these experiments show that there

is no penalty, at least with the routing algorithms used herein, associated to it.

4. A programming paradigm was proposed, with examples showing the process of im-

plementing an algorithm using that paradigm.

5. A programming interface was proposed, using the UNIX socket interface. It was

shown how with few modifications one can harness the power of SAR using this sim-

ple, well-known interface. This option was evaluated by writing an algorithm using

the interface, and comparing it to other implementations using alternative interfaces.

6. The design and implementation of an event-driven hierarchical network simulator for

behavioral and performance evaluations.

7. Two implementations of SAR systems were proposed, each ona different integration

level. Namely, they are SAR LCAN Clusters and INoCs.

8. Proposal of a minimal protocol set for the most basic SAR-related communications.

In the protocol definition itself, it can be noted how SAR requires low overhead and

little header processing.

4.2 FUTURE WORK

As mentioned earlier, it is impossible to detail everythingthat has been left to do by further

research and development projects. However, the followingfew points are questions that

100

need answering or possibilities that need deep analysis.

1. Performance analysis of packet-switched SAR networks.

2. Analysis of SAR systems where requests in a job need not be launched individually.

3. Implementation of a SAR prototype and analysis of its properties, especially perfor-

mance.

4. Integration of a SAR protocol stack in a UNIX operating system, and evaluation of

the latency incurred.

101

CHAPTER 5

PUBLICATIONS RESULTED FROM THIS

WORK

• Service discovery for GRID computing using LCAN-mapped hierarchical directo-

ries. Journal of Supercomputing, March 2007.

• Service Address Routing: A Network-Embedded Resource Management Layer for

Cluster Computing. To be published in the Parallel Computing Journal, Elsevier.

• Federated GRID Clusters using Service Address Routed Optical Networks. To be

published in Future Generation Computer Systems, Elsevier.

• Service Address Routing: A Network Architecture for Tightly- Coupled Distributed

Computing Systems. ISPAN 2005, in Las Vegas, Nevada.

• Cost-Performance Analysis of Service-Address-Routed Least-Common-Ancestor Net-

works. Submitted to The Journal of Interconnection Networks.

• Intelligent Networks on-Chip Based on Service-Address Routing. Submitted to IEEE

Computer Architecture Letters.

• Performance Analysis of Service-Address-Routed, Tightly-Coupled Computing Clus-

ters. Submitted to Elsevier’s Journal on Parallel and Distributed Computing.

• Programming Service-Address-Routed Cluster Systems. Submitted to IEEE Trans-

actions on Parallel and Distributed Systems.

102

Bibliography

[1] Adobe. Macromedia jrun. http://www.adobe.com/products/jrun/.

[2] A.S. Tanenbaum and M. van Steen.Distributed Systems; Principles and Paradigms.

Prentice Hall, 2002.

[3] M. Baker and G. Smith. Jini meets the grid.icppw, 00:0193, 2001.

[4] A. Beguelin, J. Dongarra, A. Geist, R. Manchek, and V. Sunderam. A user”s guide to

pvm parallel virtual machine. Technical report, Knoxville, TN, USA, 1991.

[5] L.-S. Cheung and Y.-K. Kwok. The design and performance of an intelligent jini load

balancing service.icppw, 00:0361, 2001.

[6] W. W. Chu and G. Ohlmacher. Avoiding deadlock in distributed data bases. InACM

74: Proceedings of the 1974 annual conference, pages 156–160, New York, NY,

USA, 1974. ACM Press.

[7] D. Cornhill. A survivable distributed computing systemfor embedded application

programs written in ada.Ada Lett., III(3):79–87, 1983.

[8] J. D. Day and H. Zimmerman. The osi reference model.Proceedings of the IEEE,

71(12):1334–1340, 1983.

[9] T. Downing. Java RMI: Remote Method Invocation. IDG Books Worldwide, Inc.,

1998.

103

[10] D. L. Eager, E. D. Lazowska, and J. Zahorjan. A comparison of receiver-initiated and

sender-initiated adaptive load sharing. InSIGMETRICS ’85: Proceedings of the 1985

ACM SIGMETRICS confere nce on Measurement and modeling of computer systems,

pages 1–3, New York, NY, USA, 1985. ACM Press.

[11] I. Foster and C. Kesselman, editors.The grid: blueprint for a new computing infras-

tructure. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999.

[12] N. Furmento, W. Lee, A. Mayer, S. Newhouse, and J. Darlington. Iceni: An open grid

service architecture implemented with jini.sc, 00:37, 2002.

[13] H. Hai, Z. Yi-fang, and C. Chi-lan. Unified modeling of complex real-time control

systems. InDATE ’05: Proceedings of the conference on Design, Automation and Test

in Europe, pages 498–499, Washington, DC, USA, 2005. IEEE Computer Society.

[14] K. S. Ho and H. V. Leong. An extended corba event service with support for load

balancing and fault-tolerance. InDOA ’00: Proceedings of the International Sympo-

sium on Distributed Objects and Applications, page 49, Washington, DC, USA, 2000.

IEEE Computer Society.

[15] Y. Huang. The Role of Jini in a Service-Oriented Architecture for GridComputing

[Ph.D. Thesis]. PhD thesis, Cardiff University.

[16] Y. Huang. Jisga: A jini-based service-oriented grid architecture.Int. J. High Perform.

Comput. Appl., 17(3):317–327, 2003.

[17] J. E. Israel and J. G. Mitchell and H. E. Sturgis.Separating Data from Function in a

Distributed File System. Xerox, Palo Alto Research Center, 1978.

[18] Z. Juhasz. Jgrid: Jini as a grid technology.Newsletter of the IEEE Computer Society’s

Task Force on Cluster Computing, 5(3), 2003.

104

[19] Z. Juhasz, A. Andics, and S. Pota. Jm: A jini framework for global computing.ccgrid,

0:395, 2002.

[20] Z. Juhasz and L. Kesmarki. A jini-based prototype metacomputing framework.Lec-

ture Notes in Computer Science, 1900:1171, 2001.

[21] L. Lamport. Time, clocks, and the ordering of events in adistributed system.Com-

mun. ACM, 21(7):558–565, 1978.

[22] K. A. Lantz, J. L. Edighoffer, and B. L. Hitson. Towards auniversal directory service.

In PODC ’85: Proceedings of the fourth annual ACM symposium on Principles of

distributed computing, pages 250–260, New York, NY, USA, 1985. ACM Press.

[23] A. L. Leiner, W. A. Notz, J. L. Smith, and A. Weinberger. Pilota new multiple com-

puter system.J. ACM, 6(3):313–335, 1959.

[24] C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi,

J. V. Hill, D. Hillis, B. C. Kuszmaul, M. A. S. Pierre, D. S. Wells, M. C. Wong,

S.-W. Yang, and R. Zak. The network architecture of the connection machine cm-5

(extended abstract). InSPAA ’92: Proceedings of the fourth annual ACM symposium

on Parallel algorithms and architectures, pages 272–285, New York, NY, USA, 1992.

ACM Press.

[25] S. Marti and V. Krishnan. Carmen: A dynamic service discovery architecture. 2002.

[26] Object Management Group. The common object request broker: Architecture and

specification. http://www.omg.org/library/c2indx.html.

[27] O. Othman, C. O’Ryan, and D. Schmidt. The design of an adaptive corba load bal-

ancing service, 2001.

[28] G. Pfister. The varieties of single system image. InAdvances in Parallel and Dis-

tributed Systems. IEEE, 1993.

105

[29] G. Popek, B. Walker, J. Chow, D. Edwards, C. Kline, G. Rudisin, and G. Thiel. Locus

a network transparent, high reliability distributed system. In SOSP ’81: Proceedings

of the eighth ACM symposium on Operating systems principles, pages 169–177, New

York, NY, USA, 1981. ACM Press.

[30] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling.Numerical Recipes in FOR-

TRAN: The Art of Scientific Computing. Cambridge University Press, second edition

edition, 1992.

[31] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker. A scalable content-

addressable network.

[32] I. Scherson and C. Chien. Least common ancestor networks.VLSI Design, 2(4), 1995.

[33] T. Schnekenburger. Load balancing in corba: A survey ofconcepts, patterns, and

techniques.The Journal of Supercomputing, 15(2), 2000.

[34] J. E. Shemer and A. J. Collmeyer. Database sharing: A study of interference, road-

block and deadlock. InProceedings of 1972 ACM-SIGFIDET workshop on Data

description, access and control, pages 147–163, New York, NY, USA, 1972. ACM

Press.

[35] R. Srinivasan. Rfc 1831: Rpc: Remote procedure call protocol specification version

2. 1992.

[36] H. Sturgis, J. Mitchell, and J. Israel. Issues in the design and use of a distributed file

system.SIGOPS Oper. Syst. Rev., 14(3):55–69, 1980.

[37] T. Suzumura, S. Matsuoka, and H. Nakada. A jini-based computing portal system.

sc, 00:17, 2001.

106

[38] The MPI Forum. Mpi: a message passing interface. InSupercomputing ’93: Proceed-

ings of the 1993 ACM/IEEE conference on Supercomputing, pages 878–883, New

York, NY, USA, 1993. ACM Press.

[39] The MPI Forum. Mpi-2: Extensions to the message-passing interface.

http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.html, 1997.

[40] S. Vinoski. Corba: integrating diverse applications within distributed heterogeneous

environments.Communications Magazine, 35(2):46–55, 1997.

[41] J. Waldo. The jini architecture for network-centric computing. Communications of

ACM 42, 42(7):76–82, 1999.

[42] B. Walker, G. Popek, R. English, C. Kline, and G. Thiel. The locus distributed oper-

ating system. InSOSP ’83: Proceedings of the ninth ACM symposium on Operating

systems principles, pages 49–70, New York, NY, USA, 1983. ACM Press.

[43] D. Werthimer, J. Cobb, M. Lebofsky, D. Anderson, and E. Korpela. Seti@home-

massively distributed computing for seti.Comput. Sci. Eng., 3(1):78–83, 2001.

107

donald bren school of information and computer sciences - …schark/valencia-thesis.pdf ·...

Documents