1 introduction - university of pittsburghpeople.cs.pitt.edu/~moir/papers/menkethesis.doc · web...

SYNCHRONIZATION METHODS FOR SCRAMNET+ REPLICATEDSHARED-MEMORY SYSTEMS

by

Stephen Frank Menke

BEE, Georgia Institute of Technology, 1993

Submitted to the Graduate Faculty of

Arts and Sciences in partial fulfillment

of the requirements for the degree of

Master of Science

University of Pittsburgh

1999

This thesis was presented

by

Stephen Frank Menke

It was defended on

April 28, 1999

and approved by

Rami Melhem, Professor of Computer Science, Committee Member

Mark Moir, Assistant Professor of Computer Science, Thesis Advisor

Daniel Mossé, Associate Professor of Computer Science, Committee Member

ii

Copyright by Stephen Frank Menke

1999

iii

SYNCHRONIZATION METHODS FOR SCRAMNET+ REPLICATED

SHARED-MEMORY SYSTEMS

Stephen Frank Menke, MS

University of Pittsburgh, 1999

SCRAMNet+ (Shared Common Random Access Memory Network) is a communications

network that transparently provides replicated shared-memory via a high-speed fiber-

optic ring topology. Such systems combine the ease of programming of shared-memory

multiprocessor systems with the distance and heterogeneity of message-passing networks.

These features are ideal for a variety of distributed real-time applications.

This thesis explores both blocking and non-blocking synchronization methods in such

systems. We first develop a mutual exclusion algorithm, the most common blocking

synchronization method, by exploiting unique features of the SCRAMNet+ hardware.

Through theoretical and experimental analysis we compare our algorithm to a mutual

exclusion algorithm suggested by the manufacturer, Systran Corp. The analysis concludes

that our algorithm is both scalable and fair Systran’s algorithm is not. Our algorithm also

has faster execution times for any size SCRAMNet+ network.

Although mutual exclusion is the most common method for synchronization, non-

blocking methods overcome a number of problems caused by the use of mutual

exclusion, such as deadlock. It is well known that strong primitives such as compare and

swap (CAS) or load-linked/store-conditional (LL/SC) are required for general non-

blocking synchronization. We therefore present and evaluate a CAS algorithm for

SCRAMNet+ systems. We validate the algorithm by incrementing a shared-memory

counter with the CAS operation. More significantly, we use the CAS algorithm to

construct lock-free and wait-free large shared large objects, which are designed to

iv

overcome the problems associated with mutual exclusion. We experiment with both lock-

free and wait-free versions of a queue to validate the large object implementation on a

real system.

Although we used a real system to perform experiments on all the algorithms, it was

limited to only two nodes. Therefore, we also built a simulator, based on Augmint, which

can model any size SCRAMNet+ network. We used experiments to validate our

simulation against our real-world results, which then allowed us to extend our analysis to

systems with more than two nodes.

v

Acknowledgements

First and foremost, I would like to thank my future wife Carolyn for her love, patience

and support. Both work and school demanded many long hours. However, her

encouragement and smile always kept me going.

I would also like to thank my advisor, Mark Moir, for his flexibly and guidance. He has

balanced my knowledge by adding the theoretical. I truly believe this will attribute to my

career, wherever it may lead.

Thanks, too, to the rest of my committee: Rami Melhem and Daniel Mossé. I am grateful

for their flexibility in arranging their schedules for my defense. This also includes the

help from Daniel’s group in setting up RT-Mach.

Finally, I would like to thank Systran Corp. for supplying the hardware, software and

documentation necessary to complete this thesis. Most importantly, Chris Fought from

technical support, whose assistance was key to developing the driver for RT-Mach.

vi

Table of Contents

1 INTRODUCTION..................................................................................................................................1

2 SCRAMNET+ HARDWARE..............................................................................................................6

2.1 GENERAL PURPOSE COUNTER / GLOBAL TIMER................................................................................6

2.2 ERROR CORRECTION...........................................................................................................................7

2.3 INTERRUPTS........................................................................................................................................7

2.4 WRITE-ME-LAST MODE.....................................................................................................................8

3 BLOCKING SYNCHRONIZATION.................................................................................................9

3.1 SYSTRAN’S MUTUAL EXCLUSION ALGORITHM..................................................................................9

3.1.1 Acquire...................................................................................................................................10

3.1.2 Release....................................................................................................................................10

3.2 OUR MUTUAL EXCLUSION ALGORITHM...........................................................................................12

3.2.1 Acquire...................................................................................................................................12

3.2.2 Release....................................................................................................................................13

3.2.3 ISR0.........................................................................................................................................13

3.3 THEORETICAL COMPARISON.............................................................................................................14

3.4 SYSTEM EXPERIMENTS.....................................................................................................................16

3.4.1 No Contention.........................................................................................................................16

3.4.2 Contention..............................................................................................................................19

3.5 SIMULATION EXPERIMENTS..............................................................................................................21

3.5.1 No Contention.........................................................................................................................22

3.5.2 Contention..............................................................................................................................22

3.5.3 Polling....................................................................................................................................25

3.5.4 Heavy Contention...................................................................................................................27

3.6 CONCLUSIONS AND FUTURE WORK.................................................................................................29

4 NON-BLOCKING SYNCHRONIZATION.....................................................................................31

4.1 COMPARE AND SWAP.......................................................................................................................31

4.1.1 CAS.........................................................................................................................................32

4.1.2 Read........................................................................................................................................32

4.1.3 Analysis...................................................................................................................................34

4.1.4 Experiments............................................................................................................................34

4.2 LARGE OBJECTS...............................................................................................................................36

4.2.1 Experiments............................................................................................................................36

4.2.2 Conclusions and Future Work................................................................................................38

5 SIMULATION....................................................................................................................................39

5.1 COMPILE-TIME.................................................................................................................................39

5.2 RUN-TIME.........................................................................................................................................40

5.2.1 Events.....................................................................................................................................40

5.2.2 Data Movement......................................................................................................................41

5.2.3 Tasks.......................................................................................................................................41

5.2.4 Threads...................................................................................................................................42

5.2.5 Backend..................................................................................................................................42

5.2.6 Execution................................................................................................................................42

5.3 SCRAMNET+ BACKENDS................................................................................................................43

5.3.1 Memory Model........................................................................................................................43

5.3.2 User Events.............................................................................................................................45

5.3.3 Write-Me-Last Backend..........................................................................................................45

5.3.4 Interrupt Backend...................................................................................................................46

5.3.5 Polling Backend......................................................................................................................49

5.4 SIMULATION PARAMETERS...............................................................................................................49

5.4.1 Transit Time............................................................................................................................49

5.4.2 Access Times...........................................................................................................................50

5.4.3 Context Switch Time...............................................................................................................50

5.5 CONCLUSIONS AND FUTURE WORK.................................................................................................51

6 SUMMARY AND CONCLUSIONS.................................................................................................53

APPENDIX A................................................................................................................................................55

A.1 SCRAMNET+ DRIVER.....................................................................................................................55

A.2 SCRAMNET+ API..........................................................................................................................56

A.2.1 scr_mem_mm..........................................................................................................................56

A.2.2 get_base_mem........................................................................................................................56

A.2.3 scr_csr_read...........................................................................................................................56

A.2.4 scr_csr_write..........................................................................................................................57

A.2.5 scr_id_mm..............................................................................................................................57

A.2.6 scr_acr_read...........................................................................................................................57

A.2.7 scr_acr_write..........................................................................................................................57

APPENDIX B................................................................................................................................................58

B.1 SYNTAX............................................................................................................................................58

B.1.1 Augmint Parameters...............................................................................................................59

B.1.2 Backend Parameters...............................................................................................................59

B.1.3 Simulation Parameters...........................................................................................................59

B.2 EXPERIMENTS...................................................................................................................................60

BIBLIOGRAPHY.........................................................................................................................................61

List of Tables

TABLE 1 COMPARISON OF AVERAGE EXECUTION TIMES FOR PAIR OF ACQUIRE/RELAEASE OPERATIONS WHEN THE MAXIMUM NUMBER OF NODES EQUALS 256 (S).....................................................................19

TABLE 2 AVERAGE EXECUTION TIME FOR A READ OPERATION (S)..............................................................35

TABLE 3 AVERAGE EXECUTION TIME FOR CAS OPERATION (S)..................................................................35

TABLE 4 AVERAGE EXECUTION TIME FOR PAIR OF ENQUEUE/DEQUEUE OPERATIONS FOR LOCK-FREE CONSTRUCTION OF LARGE OBJECTS (S)..........................................................................................37

TABLE 5 AVERAGE EXECUTION TIME FOR PAIR OF ENQUEUE/DEQUEUE OPERATIONS FOR WAIT-FREE CONSTRUCTION OF LARGE OBJECTS (S)..........................................................................................38

TABLE 6 SIMULATOR EXECUTABLE DIRECTORIES..........................................................................................58

TABLE 7 SCRIPTS TO RUN SIMULATION EXPERIMENTS...................................................................................60

List of Figures

FIGURE 1 SYSTRAN’S MUTUAL EXCLUSION ALGORITHM................................................................................11

FIGURE 2 OUR MUTUAL EXCLUSION ALGORITHM...........................................................................................14

FIGURE 3 COMPARISON OF ME ALGORITHMS WITHOUT CONTENTION ON A REAL SYSTEM...........................17

FIGURE 4 CLOSE-UP COMPARISON OF ME ALGORITHMS WITHOUT CONTENTION ON A REAL SYSTEM..........18

FIGURE 5 TIMING SEQUENCE OF OUR ALGORITHM’S ACQUIRE PROCEDURE WITHOUT CONTENTION.............18

FIGURE 6 COMPARISON OF ME ALGORITHMS WITH CONTENTION ON A REAL SYSTEM..................................20

FIGURE 7 CLOSE-UP COMPARISON OF ME ALGORITHMS WITH CONTENTION ON A REAL SYSTEM.................21

FIGURE 8 COMPARISON OF ME ALGORITHMS WITHOUT CONTENTION ON A SIMULATED SYSTEM.................23

FIGURE 9 CLOSE-UP COMPARISON OF ME ALGORITHMS WITHOUT CONTENTION ON A SIMULATED SYSTEM 23

FIGURE 10 COMPARISON OF ME ALGORITHMS WITH CONTENTION ON A SIMULATED SYSTEM......................24

FIGURE 11 CLOSE-UP COMPARISON OF ME ALGORITHMS WITH CONTENTION ON A SIMULATED SYSTEM.....24

FIGURE 12 COMPARISON OF POLLING AND INTERRUPT VERSIONS WITHOUT CONTENTION ON A SIMULATED SYSTEM......................................................................................................................26

FIGURE 13 COMPARISON OF POLLING AND INTERRUPT VERSIONS WITH CONTENTION ON A SIMULATED SYSTEM......................................................................................................................26

FIGURE 14 COMPARISON OF ALL ME ALGORITHMS UNDER HEAVY CONTENTION ON A SIMULATED SYSTEM28

FIGURE 15 CLOSE-UP COMPARISON OF ALL ME ALGORITHMS UNDER HEAVY CONTENTION ON A SIMULATED SYSTEM......................................................................................................................28

FIGURE 16 COMPARISON OF AVERAGE EXECUTION TIMES FOR EACH NODE UNDER HEAVY CONTENTION....29

FIGURE 17 SEMANTICS OF COMPARE AND SWAP............................................................................................31

FIGURE 18 COMPARE AND SWAP ALGORITHM................................................................................................33

FIGURE 19 TIMING DIAGRAM OF OUR ALGORITHM’S ACQUIRE PROCEDURE WITHOUT CONTENTION.............51

1 Introduction

This thesis presents and evaluates synchronization mechanisms for SCRAMNet+ (Shared

Common Random Access Memory Network) systems. SCRAMNet+ is a

communications network geared toward real-time applications, and based on a replicated

shared-memory concept [12]. By combining the advantages of shared-memory multi-

processors and message passing systems, SCRAMNet+ offers distributed shared-memory

with reliable, deterministic and low-latency updates. Thus, SCRAMNet+ has proven to

be ideal for many real-time applications [19].

SCRAMNet+ systems offer the benefits of shared-memory multiprocessor, namely ease

of programming, low-latency communications, and little or no software overhead for

communications [4]. A SCRAMNet+ network consists of up to 256 computers (nodes)

each with a SCRAMNet+ network card. The network cards are interconnected through

fiber-optic cables in a serial-ring topology. Each network card has dual-ported RAM

(Random Access Memory) that can be mapped into the address space of any process on a

node. Any write to the dual-ported RAM is transparently replicated to each node, and

hence every process, in the network.

In addition to providing a shared-memory abstraction to applications, SCRAMNet+

systems also have the advantages of a message passing system. First, processors can be

connected at distances of hundreds or even thousands of meters [4]. In contrast, a typical

multiprocessor system is limited to only a few meters. SCRAMNet+ networks can also

connect machines with different architectures or operating systems. This might be an

advantage, for example, in an industrial control system where the data acquisition and

1

control are run on distributed embedded processors and the graphical interface runs on

standard PCs.

Typical distributed systems, such as industrial control systems, require concurrent access

to shared data. Usually some synchronization is required to protect the consistency of the

data. The most common method used are mutual exclusion algorithms, which protect

shared data through the access to a critical section. The semantics of mutual exclusion

prevent more that one process from entering a critical section at time, therefore limiting

access to the data. The manufacturer, Systran Corp., presents a mutual exclusion

algorithm for SCRAMNet+ memory systems in [15]. However, this algorithm has several

shortcomings.

Its performance is drastically affected by the number of nodes in the system;

The solution is not starvation free. It is theoretically possible for one process

to repeatedly attempt to acquire the lock but never succeed; and

The solution necessarily prioritizes the processes, but does not make any

concrete guarantees. Furthermore, the prioritization mechanism is unavoidable

and leads to starvation of lower priority nodes.

In this thesis we present our own mutual exclusion algorithm for SCRAMNet+ systems.

This algorithm exploits special hardware features of the SCRAMNet+ network and is

both fair and starvation free. We compared the two algorithms using both real-system

experiments and simulations that compute the average execution time for a pair of

acquire/release operations. The results demonstrate that our algorithm has faster

execution times both with and without contention, regardless of the network’s size.

Although both algorithms are sufficient for synchronization, when one process enters the

critical section, any other process desiring access to the shared data must wait indefinitely

for that process to exit the critical section.

2

Recently, significant progress has been made toward efficient lock-free and wait-free

implementation of shared objects (e.g. [2, 3, 6, 7, 8, 9]). A shared object is a shared data

structure and associated operations. A lock-free implementation of a shared object

guarantees that after a finite number of steps of a process p’s operation, some process

(not necessarily p) completes an operation on the object. A wait-free implementation

guarantees that each operation of a process p completes after a finite number of p’s steps.

The result is fault tolerance, meaning some process (lock-free) or the actual process

(wait-free) will continue to progress, regardless of the failure of any other process. A

mutual exclusion algorithm cannot be either lock-free or wait-free because if a process

never exits the critical section, no other process can continue.

In [6] Herlihy defines universal objects that can construct any wait-free object. He

assigns a consensus number to each object, where a consensus number of n can

implement any wait-free object for up to n processes. Herlihy also proved that CAS

(Compare and Swap) is universal and has a consensus number of infinity. Therefore,

CAS is an important primitive to implement in a shared-memory system that requires

wait-free objects. Given this, we have implemented and evaluated a CAS algorithm for

SCRAMNet+ systems. By conducting a simple experiment that used a CAS to increment

a shared counter concurrently, we validated the correctness of this algorithm. We also

compared experiments with and without contention, and found that our algorithm

performs well under contention. However, the contention experiments were only run with

two nodes and therefore further testing is needed. Now that we have created an effective

CAS primitive for SCRAMNet+ systems, we can construct wait-free objects for such

systems.

Herlihy extended his work in [6] by suggesting lock-free and wait free constructions for

large shared objects. However, the implementation is inefficient due to the large amount

of data being copied – especially when much of the copying may be unnecessary. In [2]

3

Anderson and Moir present a more efficient implementation of lock-free and wait-free

constructions for large shared objects. In [5], Filachek furthers their work by

implementing and testing their algorithms in simulations. We have furthered the study by

porting and testing the algorithms to a SCRAMNet+ system. Our main objective was to

validate the operation of the algorithms. We accomplished this task by testing concurrent

access to lock-free and wait-free implementations of a queue on an actual system and

verifying the consistency of the queue.

The original evaluation of all our algorithms was performed on an actual system

consisting of two 266 MHz Pentium II PCs running the RT-Mach operating system. Each

PC was equipped with a SCRAMNet+ network card with 2MB RAM interconnected with

single-mode fiber optic cables. However, due to the availability and cost of the hardware,

we were only able to construct a system with two nodes. This was sufficient for testing

the algorithms without contention, but provided little insight to situations with many

nodes and heavy contention. Therefore, we designed SCRAMNet+ simulators using

Augmint.

Augmint is a fast, execution-driven multiprocessor simulator for Intel x86 architectures

[16]. Augmint allows the modification of a library called the backend to implement

various memory models. We created three different backend libraries to model different

configurations of the SCRAMNet+ system, and then duplicated the original experiments

for mutual exclusion in order to compare the simulations to our real world results. The

comparison verified the accuracy of our simulators allowing us to continue the

simulations with confidence in the results. We then used the simulators to evaluate the

mutual exclusion algorithms under heavy contention. The results of these experiments

show that Systran’s algorithm fails to guarantee its prioritization scheme. It also shows

that by modifying our algorithm to use polling techniques versus interrupts, the resulting

4

algorithm will outperform Systran’s algorithm with heavy contention regardless of the

number of nodes in the network.

The remainder of this thesis is organized as follows. We provide an overview of the

SCRAMNet+ hardware in Section 2. Section 3 covers blocking synchronization methods

for SCRAMNet+ systems. It contains a detailed description of Systran’s and our mutual

exclusion algorithms and an analysis of the experiments performed on the real system

and on simulations. Section 4 covers non-blocking synchronization methods for

SCRAMNet+ systems. It describes a CAS algorithm and analyzes the results of real

world experiments. It then presents results of experiments for lock-free and wait-free

objects implemented with the CAS algorithm. Section 5 contains an overview of

Augmint and a full description of the simulation implementation. In Section 6, we

summarize the overall results and conclusions.

5

2 SCRAMNet+ Hardware

SCRAMNet+ cards have many configurable features. This section describes the features

of interest to this thesis. For more detailed information or a complete listing and

explanation of all features, see [12]. To understand the algorithms in this thesis it is first

necessary to understand how the SCRAMNet+ network operates.

A SCRAMNet+ node updates the shared-memory on all other nodes by inserting a

message on the ring for every write to shared-memory. The message contains the

memory offset and value of the word written. When the message is received by another

node, the write is replicated by writing the same value to its memory. When the

originating node receives its own message, the message is removed from the ring.

Although SCRAMNet+ uses a ring topology, it is essentially a point-to-point network in

a ring orientation. That is, a message must be received and retransmitted by each

intermediate node to traverse the ring. This introduces a minimum delay of 247

nanoseconds at each node [12]. For our experiments we used the fixed size packet

configuration, which according to [12] has a maximum delay of 800 nanoseconds at each

node. Therefore, our two-node system should have a round-trip transit time between 494

and 1600 nanoseconds.

2.1 General Purpose Counter / Global Timer

SCRAMNet+ cards provide a General Purpose Counter / Global Timer that can measure

the round-trip transit time of a message with a resolution of 26.66 nanoseconds. Using

this timer the transit time on our two-node network was measured as 1270 nanoseconds,

which is within the expected range.

2.2 Error Correction

The SCRAMNet+ network has a bit error rate of 10-15, meaning that an error might occur

once every 76 days of continuos, 24 hour, 100% bandwidth-saturated network utilization

[20]. Although rare, these errors must still be handled. We configured the SCRAMNet+

card in PLATINUM mode to detect and handle any errors. PLATINUM mode can detect

and correct two types of errors. First, bit errors are detected with a bit-by-bit comparison

of the message once it has returned back to the originating node. Second, a configurable

time-out can detect the loss of any originated message. If either type of error occurs, they

are corrected by automatically re-transmitting the original message until it is received

correctly. Also, once an error has been detected, any new messages from that node are

stored in a transmit FIFO and not sent until the message that was in error is received

correctly. Therefore, PLATINUM mode guarantees that every message is eventually

delivered correctly. SCRAMNet+ cards can also be configured to generate an interrupt on

the host whenever an error occurs. We used this interrupt in all of our experiments to

generate an error message, however it never occurred.

2.3 Interrupts

In addition to interrupting on errors, the SCRAMNet+ network cards can be configured to

generate an interrupt whenever a given 32-bit memory word is written. Each 32-bit word

in SCRAMNet+ memory has an associated ACR (Auxiliary Control RAM) location that

is used to configure this feature. Each ACR can be configured to send interrupts, receive

interrupts or both. Although the memory of the cards is replicated on every node, the

ACRs are not. Therefore, the interrupt configuration for each word can be different on

every node.

Whenever a node writes a 32-bit memory word, the ACR for that word on that node is

checked. If it is configured to send interrupts, an interrupt message is generated

containing the memory offset of the word written. Whenever a node receives an interrupt

message, the ACR for the word written is also checked. If the ACR is configured to

receive interrupts, the memory offset for that word is stored in a FIFO (First-In, First-Out

data buffer) for the ISR (Interrupt Service Routine) to interrogate. The first entry into the

FIFO generates an interrupt on the host and disables the interrupt hardware until re-

enabled by the ISR. Any subsequent interrupt messages are inserted in the FIFO without

generating an interrupt. The ISR then continually processes the interrupt FIFO until it is

empty. This allows the ISR to process multiple interrupts with only one context switch.

Once the ISR detects the FIFO is empty, it re-enables the interrupt and exits.

Both the mutual exclusion and CAS algorithms presented in this thesis exploit this

interrupt feature by enabling node 0 to receive interrupts. Writing to specific shared-

memory words generates interrupt messages signaling the ISR on node 0 of a request.

The ISR essentially arbitrates between concurrent requests from other nodes. Processes

on the ISR node may also participate in the algorithm, because node 0’s ACRs for the

appropriate words are configured to both send and receive interrupts. The SCRAMNet+

card must also be configured to enable self-interrupts, which allows a node to receive its

own interrupt messages. Systran’s algorithm does not use the interrupt features of their

cards. Instead they must use the Write-Me-Last mode which is described next.

2.4 Write-Me-Last Mode

Normally when a node writes to a shared-memory word, the word is immediately

modified on the originating node and a message is propagated around the ring replicating

the write to all other nodes. In Write-Me-Last mode, the originating node of a write is the

last node to have its memory word written. This is achieved by only modifying the

originating node’s memory when it receives its own message. This can be used to

guarantee that data is available on all other nodes by writing a value to a shared-memory

word and then spinning on the word written until it changes to that value. Systran uses

this technique in their mutual exclusion algorithm (see Section 3.1).

3 Blocking Synchronization

Mutual exclusion algorithms are a form of blocking synchronization. The semantics of

mutual exclusion prevent more than one process from entering the critical section at a

time. A process enters the critical section via the Acquire() procedure. If a process B

attempts to enter a critical section while another process A is already in the critical

section, that process B remains in the Acquire() procedure until process A performs a

Release(), which exits the critical section. Therefore, process B is blocked until process A

exits the critical section.

In this section we present a mutual exclusion algorithm suggested by Systran in [15] and

a new mutual exclusion algorithm based on interrupt features of the SCRAMNet+

hardware. We also present the results of both real world and simulation experiments

comparing the two.

3.1 Systran’s Mutual Exclusion Algorithm

Figure 1 contains Systran’s mutual exclusion algorithm, which is described in [15]. The

programming notation used is similar to notation of most shared-memory algorithms and

should be self-explanatory. Their algorithm requires that a node request to enter the

critical section by setting a flag. It must then determine that no other nodes are in the

critical section by reading the flags of all the other nodes. If any other node’s flag is set,

there has been a collision (more than one node has simultaneously written to its flag) and

one of the nodes must continue while the others reset their flags and retry. Systran

suggests a prioritization scheme whereby the lower priority node retries and the higher

priority node may continue.

10

Systran’s algorithm also requires that the SCRAMNet+ system be configured for Write-

Me-Last mode, as described in Section 2.4. This is necessary to guarantee that all nodes

have seen a write before the originating node continues. This is achieved by writing to a

value and spinning on that value until it changes. The acquire() and release() procedures,

described next, implement the entering and exiting of the critical section respectively.

3.1.1 Acquire

Each node’s flag is represented by an element in the array FLAG[N] where N is the

number of nodes in the system. Nodes are prioritized with the highest priority node as the

first element in the array and the lowest priority node as the last. To enter the critical

section, a node n must continually read the entire FLAG array until every element is zero.

This indicates that no other node is currently in the critical section. Then the node writes

a non-zero value to its element in the array, FLAG[n]. It then spins on that array element

until that value is read back. Since the SCRAMNet+ network is in Write-Me-Last mode,

this guarantees that all other nodes have seen its request. Now the node must scan the

FLAG array from highest to lowest priority to see if there have been any collisions.

If a collision is detected, the lower priority node removes its request by writing a zero to

FLAG[n] and starts back at the beginning of the loop. The higher priority node spins on

the lower priority node’s array location until it changes to zero or a time-out expires. The

time-out is necessary because the lower priority node may not have even seen the higher

priority node’s request and will not have cleared its flag. Systran suggests at time-out of

one message transit time. This is easily achieved in Write-Me-Last mode by incrementing

FLAG[n] and waiting to see it change. If the higher priority node does time-out, it

revokes its request by writing a zero to FLAG[n] and starts back at the beginning of the

loop. Otherwise, it continues and enters the critical section.

11

3.1.2 Release

To exit the critical section node n simply writes a zero to its designated array location,

FLAG[n].

12

13

Shared variable FLAG: array[0..N-1] of integer

Local variable i: 0..N-1; zero: boolean; grant: boolean; attempts: integer

procedure Acquire()begin attempts := 0; do grant := true; FLAG[n] := 0; do zero := true; for i := 0 to N-1 do if FLAG[i] ≠ 0 then zero := false; break fi od while ¬zero;

/* Write and wait for own request */ attempts := attempts + 1; FLAG[n] := attempts; While FLAG[n] ≠ attempts do od;

for i := 0 to N-1 do if FLAG[i] ≠ 0 then if i < n then grant := false; break else if i > n then /* Write and wait for one round trip or revoke */ attempts := attempts + 1; FLAG[n] := attempts; While (FLAG[n] ≠ attempts) ^ (FLAG[i] ≠ 0) do od;

if FLAG[i] ≠ 0 then grant := false;

break fi fi fi od while ¬grantend

procedure Release()begin FLAG[n] := 0end

Figure 1 Systran’s mutual exclusion algorithm

14

3.2 Our Mutual Exclusion Algorithm

As explained in Section 2.3, the ISR on node 0 is configured to receive all interrupts in

the system. Our algorithm uses three shared variables to communicate between the ISR

and the nodes. The REQ array is configured to generate interrupts that signal the ISR that

a node requests access to the critical section. The GRANT array is used as a spinlock that

the ISR will write to notify a node that it has been granted access to the critical section.

Finally, RELEASE is configured to generate an interrupt, notifying the ISR that a node

has exited the critical section. The following three sections explain the code for acquire(),

release() and the ISR, which are shown in Figure 2. We have added one definition snaddr

to our programming notation, which is an address within SCRAMNet+ memory.

3.2.1 Acquire

To enter the critical section, a process p must first perform a local acquire. The local

acquire synchronizes processes on the same node by only allowing one process per node

to attempt to enter the critical section. This bounds the size of the arrays to the number of

nodes in the system, rather than the number of processes in the system. It also eliminates

unnecessary network and ISR activity by eliminating multiple requests from the same

node. Any local mutual exclusion algorithm can be used and the same method could be

applied to Systran’s algorithm. However, the local acquire was excluded from our

experiments so that timing particular to the algorithms could be studied.

Once process p has returned from the local acquire, it writes false to GRANT[n], where n

is the node that process p resides on, to initialize the spinlock. Then the process writes

true to REQ[n]. This generates an interrupt message to node 0. The ISR on node 0 then

determines which node is requesting to enter the critical section from the offset in the

interrupt FIFO. Meanwhile process p is spinning on GRANT[n]. Once the ISR has

determined to let the node enter the critical section, it writes true to GRANT[n] and

process p on node n exits the spinlock and continues. A process may now access any

shared data and exit the critical section via the release() procedure.

3.2.2 Release

To exit the critical section, process p writes true to RELEASE. This generates an interrupt

message to node 0, thereby notifying the ISR that a node is exiting the critical section.

RELEASE does not need to be an array due to the semantics of mutual exclusion. That is,

only one node can be in the critical section at a time. In contrast, because multiple nodes

may make concurrent requests, GRANT and REQ must be arrays.

3.2.3 ISR0

The ISR on node 0 maintains two local variables: owner is the node currently in the

critical section and wait is a FIFO queue of nodes waiting for the critical section. If an

interrupt occurs and the offset is within the REQ array, the ISR determines whether to

grant the critical section to the requesting node (req) based on the state of owner. If there

is no current owner, owner equals –1, the request is granted by writing true to

GRANT[req]. Otherwise req is inserted in the wait FIFO.

If the interrupt is caused by a write to RELEASE, then the next node in the wait FIFO

queue is granted the critical section by writing true to GRANT[owner]. If wait is empty,

then owner is set to –1, indicating that the critical section is available. Whenever the ISR

grants the critical section to a node, by writing true to either GRANT[req] or

GRANT[owner], the corresponding process p on node req or owner is released from its

spinlock.

Figure 2 Our mutual exclusion algorithm

Shared varable REQ, GRANT: array[0..N-1] of boolean initially false; RELEASE: integer

interrupts writes to REQ and RELEASE interrupt node 0

private variable for node 0 wait: queue of 0..N-1; req: integer; owner: -1..N-1 initially –1

isr ISR0 (addr: snaddr)begin if addr = &RELEASE then if empty(wait) then owner := -1 else owner := dequeue(wait);

GRANT[owner] := true fi else req = (addr - &STATE[0])/4; /* Determine which process made the request. There are 4 bytes per word */ if 0 req ^ req < N then if owner –1 then enqueue(wait,req) else owner := req; GRANT[req] := true fi fi fiend

procedure Acquire()begin LocalAcquire(); GRANT[n] := false; REQ[n] := true; While GRANT[n] do odend

procedure Release()begin RELEASE := true; LocalRelease()end

3.3 Theoretical Comparison

Both algorithms have arrays of length N and our algorithm also has a queue of size N.

Therefore, the space complexity of both algorithms is O(N), where N is the number of

nodes on the network. However, the time complexity of Systran’s algorithm in the

absence of contention is O(N) versus O(1) for our algorithm. This is because Systran’s

algorithm scans every element of its FLAG array, whereas our algorithm only uses the

REQ array to generate interrupts and the GRANT array as a spinlock. The execution time

of the ISR is also constant and is not affected by the number of nodes when there is no

contention.

However, increasing the ring size does increase the execution time of both algorithms by

also increasing the round-trip transit time of the network. This is inherent to the design of

the network and can not be avoided by any algorithm. Thus we do not consider the transit

time when computing the time complexity of either algorithm.

Furthermore, Systran’s algorithm is not starvation-free because of the chance of

collisions. It is possible that collisions never end and no progress is made by some node

(starvation) or by any node (live-lock). The likelihood of such a situation increases as the

number of nodes increase. Although Systran suggests a priority scheme, there is no

guarantee of the prioritization because a higher priority node may never get into the

critical section if it has a much slower processor speed than some other nodes. This is

because when a node releases the critical section and the array becomes all zeros, a faster

processor might detect this earlier and enter the critical section before the higher priority

but slower node.

In contrast, our algorithm is starvation free due to the use of two FIFOs to order the

requests. First, the interrupt FIFO on the SCRAMNet+ cards ensures that the ISR on

node 0 receives the interrupt messages in First-In First-Out order. The error correction

and retransmission feature of PLATINUM mode also ensures that all interrupt messages

are eventually received correctly by node 0. Second, the wait FIFO is used by the ISR to

orders the nodes waiting for the critical section. All of the analytical comparisons above

were verified through experiments on an actual SCRAMNet+ system and through

simulations.

3.4 System Experiments

Experiments were performed to test each algorithm both with and without contention. In

general, each experiment performed 10,000,000 iterations of an Acquire/Release pair

with an increment of one global variable between them. The average execution time of

the pair of operations was determined by dividing the total time to execute the experiment

by the number of iterations. Due to our limited hardware, all experiments were performed

on a system with only two nodes. However, the total possible number of nodes in the

experiment was varied. For example, the array sizes increase for both algorithms and the

queue size increases for our algorithm. This does not take into consideration the increase

in the round-trip time, but does examine the effects from the algorithm.

For all of the experiments on Systran’s algorithm, node 0 was configured as the highest

priority node and was therefore assigned the first element of the FLAG array. Likewise,

node 1 was configured as the lowest priority node and assigned the second element in the

FLAG array. For all the experiments on our algorithms, node 0 was configured to execute

the ISR and participate in the algorithm and node 1 was configured to only participate in

the algorithm. The first experiments performed were in the absence of contention.

3.4.1 No Contention

The experiments without contention were executed on each node and for each algorithm

but without the other node participating. Figure 3 graphs the average execution time for

each algorithm executed on each node as the maximum number of nodes is increased.

The graphs show that our algorithm scales far better than Systran’s algorithm. This is due

to our O(1) time complexity in absence of contention compared to Systran’s O(N), where

N is the number of nodes on the ring. This is because Systran’s algorithm scans its FLAG

array of length N at least twice per acquire. In contrast, our algorithm only uses its REQ

array to generate interrupts and its GRANT array as a spinlock. Therefore, their graphs

rise as the number of nodes increase where our graphs are flat.

Figure 4 shows a close-up view of the same graphs as Figure 3. It shows that our

algorithm performs better than Systran’s algorithm when a system contains 9 nodes or

more. The graphs also indicate a difference between node 0’s and node 1’s average

execution times for our algorithm. This is due to the different sequence of events for the

acquire procure on each node, as shown in Figure 5. Node 0 does not have to wait for a

transit time for the ISR to see its writes and vice versa, since they both are on the same

node. However, the process on node 0 does suffer from a context switch delay between

the time the ISR finishes and when the process can continue. In contrast, node 1 must

always wait a transit time for the ISR to see its writes and vice versa, but it does not have

to wait for a context switch before continuing. This is because a message is sent to node 1

as soon as the ISR writes to GRANT[1] and then the context switch occurs as the ISR

exits. This context switch does not affect node 1 since it occurs on node 0.

0

50

100

150

200

250

300

350

400

450

500

550

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256

Maximum number of Nodes

Aver

age

Exec

utio

n Ti

me

for A

cqui

re/R

elea

se P

air (

μs)

Systran's Alg. (Highest Priority)

Systran's Alg. (Lowest Priority)

Our Alg. (Node w/ ISR)

Our Alg. (Node w/o ISR)

Figure 3 Comparison of ME algorithms without contention on a real system

0

5

10

15

20

25

30

35

40

45

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Maximum Number of Nodes

Ave

rage

Exe

cutio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs)

Sytran's Alg. (Highest Priority)Systran's Alg. (Lowest Priority)Our Alg. (Node w/ ISR)Our Alg. (Node w/o ISR)

Figure 4 Close-up comparison of ME algorithms without contention on a real system

Figure 5 Timing sequence of our algorithm’s acquire procedure without contention

Node 0:1. Process writes to REQ[0]

2. Interrupt occurs and context switch changes to ISR

3. ISR write true to GRANT[0]

4. ISR exits and context switch changes to the process

5. Process sees GRANT[0] = true

Node 1:1. Process writes to REQ[1]

2. Transit time delay for write to reach node 0

3. Interrupt occurs and context switches changes to ISR

4. ISR writes true to GRANT[1]

5. Transit time delay for write to reach node 1

6. Process sees GRANT[1] = true

3.4.2 Contention

The experiments with contention were executed simultaneously on both nodes with a

barrier to synchronize the start of the experiment. The first test was to verify that the

semantics of mutual exclusion were maintained. This was achieved by simply verifying

that the global counter, which is incremented between the Acquire/Release pair, was

twice the number of iterations, which was always true for both algorithms. The

experiments also calculated the average execution times as before.

Figure 6 and Figure 7 graph the average execution time for each algorithm as the

maximum possible number of nodes is increased. The results indicate that the average

execution time increases with contention for Systran’s algorithm whereas its decreases

for our algorithm. This is demonstrated by computing the combined average execution

time of both nodes and comparing the result to the average without contention. The

average time with contention was calculated by taking the maximum value of the two

nodes and dividing it by two. The maximum value is used because it is the time that both

nodes finished all the iterations. The average time without contention was computed by

adding the results for the two nodes without contention and dividing that value by two.

Table 1 contains the computations from the results with a maximum number of nodes

equal to 256.

AVERAGE SYTRAN’S ALGORITHM OUR ALGORITHMWith contention (1020.4 / 2) = 510.3 (40.3 / 2) = 20.2Without contention (509.4 + 507.3) / 2 = 508.4 (22.8 +19.0) / 2 = 20.9

Table 1 Comparison of average execution times for pair of acquire/relaease operations when the maximum number of nodes equals 256 (s)

Both graphs also demonstrate that our algorithm is fair to both nodes, while Systran’s

algorithm is not. That is, with our algorithm both nodes have identical executions times

of 40.3 microseconds, while there is a significant difference with Systran’s algorithm.

Their lower priority node takes twice as long as the higher priority node, because it is

completely starved by the priority mechanism. The highest priority node essentially runs

to completion and then the lowest priority node runs to completion. Although this is a

consequence of their design, it is not guaranteed as explained in Section 3.3. However,

our algorithm could guarantee this through sorting its wait queue by priority.

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256


Aver

age

Exec

utio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs)


Systran's Alg. (Lowest Prioirty)ISR Alg. (Node 0 w/ ISR)

ISR Alg. (Node 1 w/o ISR)

Figure 6 Comparison of ME algorithms with contention on a real system

0

10

20

30

40

50

60

70

80

90

100

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Ave

rage

Exe

cutio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs)

Systran' Alg. (Highest Priority)Systran' Alg. (Lowest Priority)Our Alg. (Node w/ ISR)Our Alg. (Node w/o ISR)

Figure 7 Close-up comparison of ME algorithms with contention on a real system

3.5 Simulation Experiments

Our real system only consisted of two nodes, which was sufficient for testing the

algorithms without contention but neglected the affects of the network’s round trip time.

It also limited the number of nodes that could participate in the experiments. Therefore,

we designed a SCRAMNet+ simulator for any number of nodes, which is described in

Section 5. The results of the simulation experiments for the mutual exclusion algorithms

are provided below.

The first experiments were identical to those on the real system and were performed both

with and without contention. As before, only two nodes were used and the maximum

number of possible was increased. Only the trends of the simulation and real system

experiments should be compared since the simulation does not take all factors into

account. One factor is the activity of the RT-Mach operating system, such as clock

interrupts, swapping, daemons, etc. Not considering these factors should make the results

of the simulation faster than the real-system results, which is the case.

3.5.1 No Contention

The simulation results of the experiments without contention are shown in Figure 8 and

Figure 9, which correspond to the real-system results in Figure 3 and Figure 4. These

graphs are similar in both the slope and magnitude of the graphs. They also show a

similar difference between the nodes with and without the ISR for our algorithm. That is

the node without the ISR has faster executions times than nodes with the ISR when

executed without contention.

3.5.2 Contention

The simulation results of experiments with contention are shown in Figure 10 and Figure

11, which correspond to the real-system results in Figure 6 and Figure 7. These graphs

also show similar trends in slope and magnitude. They also show that both nodes in our

algorithm have identical execution times when under contention.

The similarity of the simulation and real-world results validate the implementation and

parameters used for our simulations (see Section 5.4). This allowed us to perform more

simulation experiments with confidence in the results.

0

50

100

150

200

250

300

350

400

450

500

550

600

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256


Aver

age

Exec

utio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs)



Our Alg. (Node 0 w/ ISR)

Our Alg. (Node 1 w/o ISR)

Figure 8 Comparison of ME algorithms without contention on a simulated system

0

5

10

15

20

25

30

35

40

45

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Ave

rage

Exe

cutio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs) Systran's Alg. (Highest Priority)




Figure 9 Close-up comparison of ME algorithms without contention on a simulated system

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256


Aver

age

Exec

utio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs)

Systran's Alg. (Highest Priority)Systran's Alg. (Lowest Prioirty)

Our Alg. (Node 0 w/ ISR)Our Alg. (Node 1 w/o ISR)

Figure 10 Comparison of ME algorithms with contention on a simulated system

0

10

20

30

40

50

60

70

80

90

100

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


Ave

rage

Exe

cutio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs)




Our Alg. (Node 1 w/o ISR)

Figure 11 Close-up comparison of ME algorithms with contention on a simulated system

3.5.3 Polling

We also used the simulator to implement a polling version of our mutual exclusion

algorithm. It uses a dedicated node to continually poll the interrupt FIFO and execute the

same code as the ISR. This eliminates any context switch times, but adds one extra transit

time to the round trip time. We used the same experiments from the previous mutual

exclusion simulations to compare against the new polling version.

The results comparing the ISR and polling versions without contention are shown in

Figure 12. With the polling version, both nodes have the same average execution time

without contention. This is because neither thread runs on the ISR node. There is also a 4

microsecond or 27 percent improvement because there is no context switch delay in the

polling version. There is not a complete context switch time difference of 5-microsecond

because the polling version uses a dedicated node, requiring three nodes for two nodes to

participate. This increases the ring size and the round-trip transit time. The results also

further validate the context switch calculation from the timing diagrams in Figure 19

which shows one context switch for nodes without the ISR when using the interrupt

version of the our algorithm.

The results comparing the interrupt and polling versions under contention are shown in

Figure 13. The trends are similar for both algorithms. That is, both nodes have identical

execution times. However, the removal of the context switch reduces the total execution

time of the polling version by 18 microseconds or 50 percent.

0

5

10

15

20

25

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256


Ave

rage

Exe

cutio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs) Polling Ver. (Node 1)

Polling Ver. (Node 2)Interrupt Ver. (Node 0 w/ ISR)Interrupt Ver. (Node 1 w/o ISR)

Figure 12 Comparison of polling and interrupt versions without contention on a simulated system

0

5

10

15

20

25

30

35

40

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256


Aver

age

Exec

utio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs)

Polling Ver. (Node 1)Polling Ver. (Node 2)Interrupt Ver. (Node 0 w/ ISR)Interrupt Ver. (Node 1 w/o ISR)

Figure 13 Comparison of polling and interrupt versions with contention on a simulated system

3.5.4 Heavy Contention

The main goal of developing the simulators was to perform experiments with a large

number of nodes without the actual hardware. Because our real system only had two

nodes, we could only increase the total possible number of nodes by increasing the sizes

of the structures used by each algorithm. However, this did not affect the ring size and

round-trip time of the messages. With our new simulations we created experiments where

every node added increases the ring size and that node participates in acquiring the

critical section.

The results comparing the mutual exclusion algorithms are shown in Figure 14 and

Figure 15. These figures graph the average time for a node to execute an Acquire/Release

pair of operations. The average was computed by taking the maximum execution time

and dividing that value by the number of nodes in the experiment. This computation was

used because when under contention, all nodes executed simultaneously. Therefore, all

nodes are finished at that maximum time. The results show that both the interrupt and

polling versions of our algorithm clearly outperform Systran’s algorithm when under

heavy contention. In fact the polling version outperforms their algorithm with any

number of nodes in the ring. Our execution time does increase as the number of nodes

increase, but this is attributed to the increase in ring size. As the size increases it takes

longer for a message to traverse the ring, ultimately increasing the execution times.

Figure 16 shows the average execution time for each node on a 256-node system under

heavy contention. Our algorithm is clearly fair since each node has an identical average

for both the interrupt and polling versions. In contrast, Systran’s algorithm starves any

lower priority node. The graph indicates this because as the node number increases, the

execution time increases and node’s priority decreases. However, the graph is not strictly

monotonically increasing which indicates this priority is not always guaranteed.

0

100

200

300

400

500

600

700

800

900

1000

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256

Number of Nodes

Max

imum

Exe

cutio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs)

Systran's Alg.Interrupt Ver.Polling Ver.

Figure 14 Comparison of all ME algorithms under heavy contention on a simulated system

0

5

10

15

20

25

30

35

40

45

50

55

60

65

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Number of Nodes

Max

imum

Exe

cutio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs)

Systran's Alg.Interupt Ver.Polling Ver.

Figure 15 Close-up comparison of all ME algorithms under heavy contention on a simulated system

0

100

200

300

400

500

600

700

800

900

1000

0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256Node

Aver

age

Exec

utio

n Ti

me

for A

cqui

re/R

elea

se P

air (

µs) Systran's Alg.

Interrupt Ver.Polling Ver.

Figure 16 Comparison of average execution times for each node under heavy contention

3.6 Conclusions and Future Work

Our algorithm scales far better than Systran’s algorithm both with and without

contention. In fact, the polling version of our algorithm outperforms Systran’s algorithm

regardless of the number of nodes. The results of our experiments also show our

algorithm is starvation-free were Systran’s algorithm is not.

The removing of the context switches by the polling version of our algorithm also leads

to an even more interesting extension. The mutual exclusion functionality could be

embedded into the SCRAMNet+ card. Each card could use a microprocessor and

firmware to execute the algorithm, thus avoiding interrupts and context switches on the

nodes themselves. The load could also be distributed by configuring cards on each node

to process different critical sections. For example, node 0 could process the first 5 critical

sections, node 1 the next 5 and so on.

Another modification to the hardware could reduce the amount memory used by our

mutual exclusion algorithm. Currently, the interrupt FIFO only contains the offset of the

word written. However, if the value of the write was also included, the REQ and GRANT

arrays could be reduced to just two words; REQ2, and GRANT2. A process could then

write its node number to REQ2 to signal the ISR and read GRANT2 to determine who

currently has the critical section, including itself. The ISR would use the value from the

FIFO, instead of the offset, to determine which node is requesting the critical section and

grant the critical section by writing the node number to GRANT2. RELEASE would be

used as before. Currently the variables must be arrays to avoid a race condition where the

value of a write may change before the ISR can read the value of the write.

Finally, priority-based schemes could be implemented and tested to compare against

Systran’s algorithm. However, the nondeterministic nature of their prioritization scheme

will depend on the system configuration such as, the heterogeneity of processor speeds

among nodes, the correlation between the priority of a node and its position on the ring,

and the ring size.

4 Non-blocking Synchronization

Non-blocking algorithms avoid the pitfalls of blocking algorithms, such as deadlock.

Compare and Swap is a universal object that can construct any wait-free or lock-free

object [6]. This section presents a CAS algorithm for SCRAMNet+ systems and uses it to

construct both lock-free and wait-free large objects.

4.1 Compare and Swap

To understand our implementation, it is important to first understand the semantics of

CAS, which are equivalent to the atomic code fragment in Figure 17. We have also

provided a Read operation, because most non-blocking algorithms that use CAS require

it. The semantics of the Read operation are to return the new value from the last

successful CAS operation.

Figure 17 Semantics of compare and swap

Our compare and swap algorithm uses similar ISR and spinlock techniques to our mutual

exclusion algorithm. However, instead of each element of the array representing a node,

they represent a process. Node 0 is again configured to execute the ISR, which maintains

the current value (cur) of the implemented register and arbitrates the CAS and Read

requests for all nodes.

The algorithm uses four shared arrays of length P, where P is the number of processes in

the system. The first two arrays, OLD and NEW, are used by the CAS() procedure to pass

34

CAS(X, old, new) if X = old then X := new; return true else return false

the old and new parameters to the ISR. The NEW array is also used to return the value to

the Read() procedure. The second two arrays, STAT and DONE, are used to indicate the

type of operation, read or cas, and the result of the operation, succ or fail. The STAT

array is also configured to generate an interrupt to inform the ISR of a request. The

CAS(), Read() and ISR procedures are shown in Figure 18 and are explained in the

following two sections.

4.1.1 CAS

To perform a compare and swap, a process p writes its old value to OLD[p] and its new

value to NEW[p]. This is used to pass the information to the ISR. Then cas is written to

both DONE[p] and STAT[p]. Writing cas to DONE[p] initializes the spinlock and writing

cas to STAT[p] generates an interrupt message to node 0, indicating a cas operation. Then

process p spins on DONE[p], waiting for the ISR to indicate that the operation is

complete.

When the ISR receives a cas operation it compares OLD[req] to cur, where req is the

node requesting the operation. If they are the same, the operation is successful. In this

case, cur is updated to the value of NEW[req], and succ is written to DONE[req]. If they

are different, then fail is written to DONE[req]. Either way process p is released from its

spinlock and returns the result of the operation, DONE[p].

4.1.2 Read

To read the current value of the variable, a process p must write read to both DONE[p]

and STAT[p]. Writing read to DONE[p] initializes the spinlock and writing read to

STAT[p] generates an interrupt message to node 0, indicating a read operation. Then

process p spins on DONE[p], waiting for the ISR to indicate that the operation is

complete.

35

When the ISR receives a read operation it simply writes the current value (cur) to

NEW[req] and writes succ to DONE[req], where req is the node requesting the operation.

The read procedure is thereby released from its spinlock and returns the value of NEW[p].

36

Figure 18 Compare and swap algorithm

37

shared variable OLD, NEW: array[0..P-1] of valtype; STAT, DONE: array[0..P-1] of {read, cas, succ, fail}

interrupts writes to STAT interrupt node 0

private variable for node 0 cur: valtype; req: 0..P-1

isr ISR0(addr: snaddr)begin req = (addr - &STAT[0])/4; /* Determine which process wrote STAT. */ if 0 ≤ req ^ req < P then case STAT[req] of cas: if OLD[req] ≠ cur then DONE[req] := fail else cur := NEW[req]; DONE[req] := succ fi read:

NEW[req] := cur; DONE[req] := succ esac fiend

procedure Read() returns valtypebegin DONE[p] := read; STAT[p] := read; while DONE[p] = read do od; return NEW[p]end

procedure CAS(old, new:valtype) returns booleanbegin OLD[p] := old; NEW[p] := new; DONE[p] := cas; STAT[p] := cas; while DONE[p] = cas do od; return DONE[p] = succend

4.1.3 Analysis

The space complexity of our algorithm is O(P), where P is the number of processes. This

is because all the arrays are all of size P. The time complexity of our algorithm is O(1) in

the absence of contention, since every access is directly to one element in each array. It is

also easy to see that our algorithm provides the correct semantics for both operations by

using the ISR to serialize requests for both operations.

It is tempting to implement the current value (cur) as a shared-memory variable, allowing

any process to simply read the location instead of using the ISR. However, this would

allow a node “downstream” to read an old value of the register after the ISR updates the

new value. This would be an improper serialization of the operations and violate the

semantics of Read and CAS.

One might also question why both the STAT and DONE arrays were not combined into

one array, say STAT2. The arrays were not combined because the ISR would generate an

unnecessary interrupt to itself whenever it writes succ or fail to STAT2[n]. Therefore, the

arrays were kept separate, avoiding unnecessary context switches and delays.

4.1.4 Experiments

The first experiment used a shared-memory counter incremented with the Read and CAS

operations to validate our algorithm. The counter was incremented by reading the current

value with Read(), incrementing that value, then updating that value with CAS(). If CAS()

failed, the sequence was repeated until a success, and that would be considered one

iteration of the loop. Both nodes simultaneously ran 100,000,000 iterations and the final

value was verified as twice that.

38

Two more experiments were performed for each procedure. One experiment tested the

algorithm without contention and the other with contention. Each experiment simply

performed 100,000,000 iterations of each procedure tested and an average execution time

was calculated from the total execution time of the loop. The results were as follows.

Table 2 contains the results of the Read procedure both with and without contention.

Likewise, Table 3 contains the results for the CAS procedure. As with our mutual

exclusion algorithm, the maximum possible number of processes does not affect the

performance. Therefore both tables include the results for maximum number of processes

equal to 256 or one per node. However, increasing the actual number of nodes in the

network would affect the performance due to the increase in the round-trip transit time.

READ NODE 0 W/ ISR NODE 1 W/O ISRNO CONTENTION 34.10 31.67CONTENTION 39.25 38.11

Table 2 Average execution time for a read operation (s)

CAS NODE 0 W/ ISR NODE 1 W/O ISRNO CONTENTION 37.38 34.00CONTENTION 39.67 39.08

Table 3 Average execution time for CAS operation (s)

The results for the contention experiments were measured simultaneously by using a

barrier to synchronize the start of nodes 0 and 1. Therefore the average time for both

nodes to complete one operation is the higher of the two nodes, 39.25 microseconds for

Read() and 39.67 microseconds for CAS(). In contrast, the experiments without

contention calculate the average time for only one node to perform one operation.

Therefore, the average times with contention are actually for twice as many operations as

the average times without contention, but are almost the same time. This happens because

39

the ISR can read multiple requests (FIFO entries) within one context switch. This is

highly likely since the nodes are operating in parallel. This concurrency was not as

evident in the mutual exclusion experiments, due to the semantics of mutual exclusion.

That is, although the ISR may simultaneously process two Acquire requests, only one

will receive an immediate response.

40

41

4.2 Large Objects

The lock-free and wait-free constructions for large objects of Anderson and Moir from

[2] were simulated and evaluated by Filachek in [5]. This thesis furthers their work by

implementing the large object constructs on a SCRAMNet+ system using the CAS from

above.

The implementation of the large shared objects required several modifications to the code

from [5]. One of the modifications was due to the different models used. [5] uses a

thread-based model running on a multiprocessor machine. However a SCRAMNet+

system is a process-based model running on different machines. This difference required

the addition of barriers to synchronize the initialization of the data structures between the

two nodes. A thread would just inherit such information from its parent.

All of the large object constructions in [2, 5] use the load-linked (LL), stored-conditional

(SC) and validate (VL) primitives and were implemented by Read and CAS primitives as

described in [10]. Therefore the LL, SC and VL primitives were modified to use our CAS

algorithm from Section 4. However, these primitives use 64-bit values and required our

CAS and Read operations to be modified to do likewise. This was achieved by doubling

the size of the OLD and NEW arrays, so they could be indexed as an array of long

integers.

4.2.1 Experiments

Two experiments were performed for both the lock-free and wait-free implementations of

a FIFO queue. Each experiment performed 100,000 iterations of the Enqueue/Dequeue

pair of operations on a queue. The number of iterations for these last experiments is lower

than the previous sections due to time constraints. The SCRAMNet+ cards were

borrowed for a limited time and the higher execution times caused 100,000 iterations to

take at least a day. As before, one experiment tested the algorithms without contention;

the other tested them without contention.

42

The most important result was from the experiments with contention, which validated the

correctness of the large object constructions. Checking the state of the queue throughout

the experiments validated the correctness. First, each node inserted its node number with

each enqueue operation and verified the number returned by each dequeue operation. A

valid number was either 0 or 1, since both nodes were operating in parallel. Second, a

barrier was used to detect when both nodes where finished, at which the queue was

checked if it was empty. Finally, the total number of dequeues for each node was verified

to be the same as the number of enqueues or iterations.

The lock-free contention results from Table 4 show that the algorithm may not be fair.

Node 0 has significantly shorter average executions times than node 1. Therefore it may

be possible for one node to be starved. However, this is acceptable because of the

definition of lock-free, which is that some process will make progress in a finite number

of steps. The definition does not indicate which process should make progress and

therefore allows starvation.

LOCK-FREE NODE 0 W/ ISR NODE 1 W/O ISRNo Contention 1187.5 1375.3Contention 1255.0 2583.1

Table 4 Average execution time for pair of enqueue/dequeue operations for lock-free construction of large objects (s)

Just as before, the results for contention for node 0 and node 1 were measured

simultaneously. Therefore the total time for both to complete is the higher of the two

results, 2583.1 microseconds. This value is near the sum of the two nodes without

contention (1187.5 + 1375.3 = 2562.8 2583.1). This indicates that the operations are

not concurrent as mentioned in both [2] and [5].

WAIT-FREE NODE 0 W/ ISR NODE 1 W/O ISRNo Contention 2438.6 2643.4Contention 4393.3 3186.0

Table 5 Average execution time for pair of enqueue/dequeue operations for wait-free construction of large objects (s)

43

The wait-free results are far more difficult to interpret. One would expect when under

contention the execution times for both nodes would be nearly the same. However, since

the interrupt occurs on node 0, it may not be requesting operations as quickly as node 1.

Therefore it will end later than node 1 as it finishes the rest of it operations. We believe

further experiments and analysis of systems with many nodes is necessary to explain this.

4.2.2 Conclusions and Future Work

There are two important conclusions from our CAS experiments. First, our CAS

algorithm works thus allowing us to construct both lock-free and wait-free large objects.

Second, the algorithm handles contention very well. However, more nodes are necessary

to solidify this conclusion. Comparing the results with and without contention also

indicates that the ISR operates more efficiently when heavily utilized. This is because it

can process more than one request within one context switch.

The most significant result was the implementation of the both lock-free and wait-free

large objects on a memory model as unique as SCRAMNet+. Although the execution

times are higher than hoped, the validation of the algorithm is significant. Further work in

improving the compare and swap algorithm, such as a dedicated polling node, may

increase performance. The LL, VL and SC primitives could also be implemented directly

in the ISR or other SCRAMNet+ memory to improve the performance further. The

benefits of lock-free and wait-free operations demands continued research in this area.

5 Simulation

Augmint is a software package on top of which multiprocessor memory hierarchy

simulators can be constructed for Intel Architecture specific platforms [1]. A simulator is

constructed by creating a test application with C/C++ and m4 macros supplied by

Augmint. The m4 macros are used to implement constructs such as locks, barriers,

semaphores, etc. The test application is augmented to generate events during memory

accesses. Events are also generated directly by the m4 macros. Each event has an

associated procedure in a library called the backend. By developing different backends,

different memory models can be simulated. As each thread runs in the simulation, its

execution time is calculated by the time spent to process each event in the backend and

by the time to execute each machine instruction.

The following sections provide an overview of Augmint. See [1, 16, 17, 21] for more

details. Augmint is composed of a compile-time and run-time component. The compile-

time component performs the code augmentation and the run-time component schedules

and executes the generated events.

5.1 Compile-Time

Application code is first written in C/C++ and m4 macros supplied by Augmint. A GNU

C compiler compiles the C file to generate 80x86 assembly instructions. Then a program,

called the Doctor, parses the assembly code for memory references and inserts code just

before each memory reference. This code calculates the address, size and value of the

memory reference and generates an event corresponding to the memory access. The code

also updates the thread’s time in processor cycles, to account for the execution of the

44

instructions leading up to the event. The Doctor determines the number of cycles from a

table of mnemonics and corresponding processor cycles found in the file

mnemonics.unix.x86. Finally the augmented code is linked to the Augmint and backend

libraries and is ready to run.

5.2 Run-Time

The run-time component consists of three parts; the Application, Augmint and the

backend. The application is the user’s C code, written with main() replaced by

appl_main(). Augmint is the main thread of execution for the simulation and manages

events, tasks, and threads. The backend is a library that executes the actual events.

When a simulation executable is run, Augmint executes first since is contains the actual

main(). Then Augmint schedules a task to switch to the main application thread by

calling appl_main(). The application code then executes as usual until an augmented code

region is executed. The augmented code causes an event and context switch back to the

Augmint thread. The Augmint thread creates and schedules a new task to process the

event.

5.2.1 Events

Events are generated directly from m4 macros or indirectly through memory references in

the application code. When a thread generates an event, the thread is blocked and a task is

scheduled to process the event. Each event has an associated procedure in the backend,

which is called when the task is scheduled to execute. The return value from this

procedure controls the execution of the thread that generated the event. A return value of

T_ADVANCE or T_CONTINUE allows the thread to continue. A return value of

T_FREE, T_YIELD or T_NO_ADVANCE leaves the thread blocked. The other

difference in the return values is how they affect the memory of a task, as described in

Section 5.2.3.

45

Each event is represented by a structure containing the process identifier (pid) of the

thread that generated the event, the time the event occurred and the type of event. When

an event returns T_ADVANCE, that event’s time is used to update the time of the thread

that generated the event. This allows the backend to arbitrarily delay a thread. This is

fundamental to any memory simulation, such as a cache, and is key for our

implementation of an ISR context switch (see Section 5.3.4.2). The event structure also

contains the address for use in Data Movement mode, which is described next.

5.2.2 Data Movement

Normally, Augmint performs a read and returns that value to the thread after a read event

returns from the backend. Likewise, Augmint writes the actual value of a write after a

write event returns from the backend. However, with Data Movement, the backend is

responsible for performing the actual reads and writes instead of Augmint. This is

achieved by passing the backend a pointer to the data accessed by the read or write

operation. On a read event, the backend writes the read value via the pointer and Augmint

returns that value to the thread. On a write event, the pointer references the value written

by the thread and the backend copies that value to the proper address, thus performing the

write.

The Data Movement option was key to implementing the Write-Me-Last mode (see

Section 5.3.3) and CSR registers (see Section 5.3.4.1). In general, our backends require

each thread to allocate its own copy of SCRAMNet+ memory. With Data Movement, our

backends control the values read and written from each copy of memory. In Write-Me-

Last mode, when a thread writes to SCRAMNet+ memory, its copy should change some

time after the write returns due to the transit time of the write message. Without Data

Movement the write would occur immediately after the backend returns. When the

Doctor is passed the –V option it will generate code for data movement and when used

46

in conjunction with the –V command line option to Augmint, Data Movement is enabled

(see Section B.1.1).

5.2.3 Tasks

To accommodate concurrent read and write operations, Augmint provides for scheduling

of arbitrary independent tasks. Each task has an associated structure containing a time,

priority, function pointer and pid. Tasks are added to a structure called the time wheel,

which orders the tasks by time and then by priority. Augmint executes the tasks in that

order by calling a task’s function pointer. In the case of an event, the function pointer

contains the appropriate backend procedure. As mentioned in Section 5.2.1, the return

value of a backend function controls the execution of the application thread, but is also

affects the memory of the task associated with the event. Return values of

T_ADVANCE, T_CONTINUE and T_FREE all free the memory associated with the

task’s structure. However, returning T_YIELD does not. This allows a task to be saved

and rescheduled later. We used this feature to implement our ISR context switch, as

discussed in Section 5.3.4.2.

5.2.4 Threads

Since Augmint uses a single thread of execution, an Augment thread is just a passive

structure simulating an actual thread. The thread structure contains state and context

switch information used by an associated task that actually executes the code. Each time

the application code calls CREATE(), a new application thread is created, simulating a

fork(), and a task is created and scheduled to execute the thread. The thread structure also

contains the current time, which is updated at every context switch and return of each

event.

47

5.2.5 Backend

The backend is a customizable event execution library. Each event is implemented by a

procedure in the backend. For example, a shared-memory write is implemented by

sim_write() and a read by sim_read(). Augmint passes a pid value to each event in the

backend to indicate the thread that generated the event. Therefore, a thread and pid are

interchangeable when discussing the backend.

5.2.6 Execution

The execution of a simulation is based on Threads, Events, and Tasks. However, Tasks

perform all the work by executing the events associated with them. For example, when a

thread generates a read event, a task is created and the thread is blocked. When the task is

created it is assigned the thread’s pid and its function pointer is assigned to sim_read().

When that task reaches the front of the time wheel, Augmint calls sim_read() via the

task’s functions pointer. If the event returns T_ADVANCE, Augmint reads the pid from

the current task and unblocks that thread. When a thread unblocks, its time is updated to

the time of the task and a context switch is made to begin executing the application code

until another event occurs. If sim_read() were to return T_FREE, the thread would

remain blocked until some task with the same pid returns T_ADVANCE. Therefore, a

task’s pid and associated event’s return values are used to control a thread.

5.3 SCRAMNet+ Backends

A backend can be written to simulate any given memory model. We wrote three

backends for the following SCRAMNet+ memory models; Write-Me-Last mode;

SCRAMNet+ interrupts; SCRAMNet+ polling. The Write-Me-Last backend was used to

simulate Systran’s mutual exclusion algorithm, which must be in Write-Me-Last mode.

The interrupt backend was used to simulate our mutual exclusion algorithm, which

includes context switches between the application process and the ISR on node 0. Finally,

the polling backend was used to simulate a dedicated node polling the interrupt FIFO as

48

suggested in Section 3.5. Each of these backends uses the same techniques to implement

the basic SCRAMNet+ memory model.

5.3.1 Memory Model

There are three common parameters to all SCRAMNet+ memory models: the read access

time, write access time and transit time. Since, the SCRAMNet+ card’s memory is dual-

ported RAM and it is mapped into each process, it cannot be cached. Therefore, every

read and write must directly access the bus. The read and write access times represent the

time to access the bus and the time for the card to respond. The transit time represents the

time it takes a write message to propagate from one node to the next.

In our models each thread represents one processor. Therefore, the pid of the thread is

equivalent to its node number. To simulate SCRAMNet+, each thread uses the m4 macro

G_MALLOC() to allocate memory in the backend. The memory address returned is used

as the address of the SCRAMNet+ memory. This way each node reads and writes out of

its own SCRAMNet+ memory, just like on the real system. Execution of G_MALLOC()

generates a sim_shalloc() event in the backend. When sim_shalloc() executes, it stores

the newly allocated memory’s size and address in a memory map table indexed by the pid

of the thread that generated the event. This information is then used by sim_read() and

sim_write() as described next.

Whenever a thread reads a memory location a sim_read() event is generated. The

memory size, memory address and the thread’s pid are passed to sim_read(). Sim_read()

first checks the memory map table to see if the address is in the SCRAMNet+ memory of

the pid. If it is not, then the value is immediately read and T_ADVANCE is returned in

order to unblock the thread. If it is, a new task, node_read(), is scheduled one read access

time after the current time and the thread is blocked by returning T_FREE. When

node_read() is scheduled to execute, it performs the read and returns T_ADVANCE,

49

thereby unblocking the thread. This simulates the delay of accessing the SCRAMNet+

card for a read.

Whenever a thread writes a memory location a sim_write() event is generated. The

memory size, memory address, memory value the thread’s pid are passed to sim_write().

Sim_write() first checks the memory map to see if the address is in the SCRAMNet+

memory of the pid. If it is not, then the value is immediately written and T_ADVANCE

is returned to unblock the thread. Otherwise, an new task, issue_ring_write(), is

scheduled one write access time after the current time and the thread is blocked by

returning T_FREE. When issue_ring_write() is scheduled to execute, it unblocks the

thread by returning T_ADVANCE. This simulates the delay for writing to the

SCRAMNet+ card. Issue_ring_write() is also starts propagating a write around the ring.

This is achieved by creating and scheduling a new task, node_write(). Since

issue_ring_write() unblocks the thread by returning T_ADVANCE, the thread may

proceed normally while the node_write() propagates the write around the ring.

Node_write() is passed the originating node, memory offset, destination node and value

of a write. When it executes, the SCRAMNet+ memory address is found in the memory

map table by indexing by the destination node. Then the value is written to the same

offset in the destination node’s memory. If the destination node is not the originating

node, then the destination node is incremented and another node task is scheduled one

transit time later. If the destination and originating node are the same, node_write()

simply ends by returning T_NO_ADVANCE.

One might suggest that normal memory reads and write should be modeled with a cache.

However, Augmint only supplies an infinite cache model. Since our threads all run on

different processors, the cache would never be invalidated and would only waste

50

computation time. Also, most memory accesses in our algorithms are to SCRAMNet+

memory, therefore the added accuracy of a realistic cache was deemed unnecessary.

The Write-Me-Last, interrupt and polling backends are variations of the basic memory

model described above. Each backend is different in how the originating node and initial

delay are used by issue_ring_write() and node_write(). The interrupt and polling

backends also model the interrupt features of the SCRAMNet+ card.

5.3.2 User Events

All three models use the GEN_USER_EVENT macro to generate the sim_user() event in

the backend. We defined the first parameter of GEN_USER_EVENT to specify the type

of user_event() and the second parameter to pass in data such as a return pointer. The

GET_PID type of user_event() returns the pid used by the simulation and backend. The

GET_TIME type of user_event() returns the current simulation time in cycles. Both are

used for the timing and analysis of the simulations.

5.3.3 Write-Me-Last Backend

In the Write-Me-Last backend, issue_ring_write() assigns the destination node as the

originating node plus one and schedules the first node_write() one transit time after the

current time. This causes the originating node to be written last. The Data Movement

option is essential for the Write-Me-Last mode to work. Without this option, Augmint

would automatically perform the write to the originating node’s memory after the thread

is unblocked, making Write-Me-Last mode unachievable. However, with Data Movement

it is the responsibility of the backend to perform a write. Therefore, when

issue_ring_write() returns T_ADVANCE, the thread continues. However, subsequent

reads will return the old value until the node_write() for the originating node executes,

which is scheduled last.

51

5.3.4 Interrupt Backend

In the interrupt backend, issue_ring_write() uses the originating node as the destination

node and schedules the first node_write() at the current time. This causes the write on the

originating node occur immediately and all others one transit time apart. When the

backend propagates a write, it uses a thread’s pid as the node number. However, the

interrupt backend simulates the application thread for node 0 and the ISR thread as on the

same node. This is achieved by assigning the ISR thread a pid of 0 and the application

thread for node 0 a pid of 1. Then the backend checks for writes propagating from pid 0

to pid 1. If this occurs, the node_write() for pid 1 is scheduled at the current time instead

of one transit time later. This causes the writes on the application thread and the ISR

thread to occur simultaneously. Both the interrupt and polling backends also simulate the

interrupt FIFO information on each SCRAMNet+ card.

5.3.4.1 Interrupt FIFO

As described in Section 2.2, each SCRAMNet+ card contains a FIFO of interrupt offsets.

The backend maintains a queue of memory offsets to simulate this FIFO. Access to the

interrupt FIFO is provided through CSRs (Control/Status Registers). The CSR registers

are mapped into SCRAMNet+ memory above 0x80000. We modeled the CSR’s access

exactly such that a port of the ISR code would not require any major modifications.

Therefore, a thread must allocate 0x100000 bytes through G_MALLOC() to access these

registers.

CSR4 contains the 16 least significant bits of the interrupt offset at the top of the FIFO.

CSR5 contains the 8 most significant bits and a FIFO “not empty” status bit. To simulate

this, the node_read() event was modified in both the interrupt and polling backends to

check if the read address equals CSR4 or CSR5. If a read is from CSR5, the status of the

interrupt queue is checked. If it is empty, sim_read() simply returns with the FIFO “not

52

empty” status bit cleared as the value of CSR5. If it is not empty, it dequeues an offset

and returns with the FIFO “not empty” status bit set and the 8 most significant bits of the

offset as the value of CSR5. The remaining 16 least significant bits of the interrupt offset

are stored in a static variable in the backend, which is returned by a subsequent read of

CSR4. The Data Movement option was also essential in implementing the CSR registers

by allowing the backend to control the return values of the CSR reads. Otherwise

Augmint would automatically calculate the value of a read.

To simulate the interrupt FIFO, the interrupt and polling backends also modified

node_write() to check if the address written is configured to generate interrupts. If it is,

the offset of the write is put in the interrupt-offset queue. Node_write() must then

determine if it should generate an interrupt and create a context switch on node 0 from

the application thread to the ISR thread.

5.3.4.2 Context Switches

The interrupt backend is designed to simulate the execution of both the ISR and the

application thread on node 0. To implement this, the backend assumes that the ISR’s pid

is 0 and the node 0 application thread’s pid is 1. The backend then controls the execution

of each thread through its return values to the appropriate pid.

5.3.4.2.1 ISR Context Switch

First, the ISR thread must appear to be in an idle or blocked state and then it can be

awakened whenever an interrupt occurs. This is achieved by using the WAIT_FOR_ISR

type of user_event(). The ISR thread continually calls WAIT_FOR_ISR and checks its

return value. If the return is 0, the thread executes its ISR code. If the return is 1, the ISR

thread exits the loop and terminates. The first time WAIT_FOR_ISR is called, the

generated user_event() returns T_YIELD to block the ISR thread. However, returning

53

T_YIELD does not destroy the current task, which is stored in the backend and is

scheduled later to unblock the ISR thread.

The interrupt backend has one additional parameter, context switch time, which simulates

the time for node 0 to switch between the execution of the ISR thread and application

thread. When node_write() determines that there is an interrupt, it reschedules the saved

task at the current time plus one context switch time. The function pointer of the task is

also changed to execute_isr(). When the task is scheduled, it calls execute_isr() which

sets the return value of WAIT_FOR_ISR to 0 and returns T_ADVANCE. T_ADVANCE

unblocks the thread and the return value of 0 causes the ISR thread to execute.

The backend maintains an isr_flag to determine whether to generate an interrupt or not.

The isr_flag is set by execute_isr() when the ISR is started and is cleared by the

WAIT_FOR_ISR user event when the ISR finishes. If the isr_flag is not set when

node_write() executes, node_write() will enqueue the interrupt offset and schedule the

ISR to run. Otherwise, node_write() will only enqueue the interrupt offset. This simulates

the enabling and disabling of hardware interrupts that allows one ISR thread to process

more than one interrupt message at a time.

The END_ISR type of user_event() is used by all other threads to signal that they are

done. Once all the threads finish and execute END_ISR, the generated user_event()

schedules the task saved for the blocked ISR thread to execute end_isr(). When end_isr()

executes, it sets the WAIT_FOR_EVENT return value to 1 and returns T_ADVANCE

The T_ADVANCE unblocks the ISR thread and the return value of 1 terminates the ISR.

This was necessary to signal the ISR that the simulation was over, otherwise it would

loop forever.

5.3.4.2.2 Application Thread Context Switch

54

The application thread’s execution is controlled by checking the isr_flag at each

sim_read() and sim_write() event created by pid 1. The isr_flag indicates whether the ISR

is currently executing. If the ISR is executing, the application thread on node 0 should be

blocked so that both threads do not execute simultaneously. Therefore, if isr_flag is not

set, the event executes as usual. If isr_flag is set, the current task is saved and the event

returns T_YIELD to block the thread.

Once the application thread has been blocked, the task saved for the thread is used to

unblock the thread once the ISR finishes. Since the ISR thread is in a loop, it calls the

WAIT_FOR_EVENT when it finishes servicing an interrupt. The generated user_event()

then reschedules the task saved for the application thread at the current time plus one

context switch. The user_event() does not modify the function pointer as the same event

is still desired. Because the application thread is blocked while the ISR is executing, it is

essentially delayed for the time of the ISR to execute plus on context switch time.

One might suggest that waiting for a read or write to block the application thread is not

accurate enough and that it should be blocked immediately. However, the threads only

perform local non-memory operations between each read and write and the operations are

transparent to the other threads. As long as the total delay is accounted for, the final

simulation will be accurate.

5.3.5 Polling Backend

The polling backend was designed to test a dedicated node that polled the interrupt FIFO,

from Section 3.5.3, instead of using interrupts. The polling backend is similar to the

interrupt backend except that it does not execute the ISR thread and the polling thread on

the same node. Because of this, the context switching capabilities were unnecessary and

removed. However, the WAIT_FOR_ISR technique was still used instead of actually

polling. It was necessary because otherwise it would be impossible for the ISR thread to

55

determine when to end. Although it is not a perfect simulation, the timing is accurate to

within one read access time and is sufficient for our purposes.

5.4 Simulation Parameters

The first goal of the simulations was to duplicate the results of our real-system

experiments from Section 3.4. To achieve this, four backend parameters were used to

match the results (see Section B.1.2). Since Augmint uses processor cycles as its unit of

time, all times where converted into cycles by dividing them by our real system’s clock

rate of 266 Megahertz.

5.4.1 Transit Time

The first backend parameter is the transit time of messages between nodes. We measured

the total round-trip time of our two-node system as 1270 microseconds in Section 2.1.

The transit time between nodes is half of that because there were only two nodes.

Therefore 169 cycles or 635 nanoseconds was used as the transit time for our simulations.

5.4.2 Access Times

The second and third parameters are the read and write access times. Since we did not

have any experimental values for the access times of the SCRAMNet+ cards, these two

parameters were determined by matching against our experiments on the real systems.

We used experiments identical to those used in the real system and varied each

parameter. This analysis showed that the read and write access times affect the slopes of

the resulting graphs and the transit time only affects the offset of the graphs. Therefore,

we adjusted both the read and write access times so that the slopes of the real and

simulated experiments were similar. The value used was 266 cycles or 1 microsecond,

which we argue is reasonable for two reasons. First, [11] specifies the typical read and

write access times for PCI based SCRAMNet+ cards as 133 and 240 nanoseconds

56

respectively. However, such marketing materials tend to use best-case numbers. Second,

according to [18], the typical access time for a PCI device is approximately 2-4

microseconds. However, it was published in 1995 and there has been considerable

advances in PCI chipsets since then. Therefore, our choice of 1 microsecond is within a

reasonable range of these two numbers.

5.4.3 Context Switch Time

The fourth parameter, the context switch time, was derived mathematically from the

results of our real-system experiments without contention in Section 3.4.1. In these

results the timing difference between the nodes with and without the ISR was 3.8

microseconds. As described in Section 3.4.1, this corresponds to the fact that the node

without the ISR does not have to wait for the context switch when the ISR finishes, since

it is not on the same processor. Furthermore, the ISR node does not have to wait for any

transit times since it is on the same node as the ISR and we are not using Write-Me-last

mode. The timing diagrams in Figure 19 correspond to the timing sequences of an

Acquire shown in Figure 5. Time flows from left to right and is of no particular scale.

Figure 19 Timing diagram of our algorithm’s acquire procedure without contention

57

ISR Node:

Context Switch ISR Context Switch

Normal Node:

Transit Time Context Switch ISR Transit Time

The context switch time was derived from the timing diagrams and the following

calculation:

[ISR Node Time] – [Normal Node Time] = 22.8 – 19.0 = 3.8 μs

[(2 * Context Switch) + ISR] – [(2 * Transit Time) + Context Switch + ISR] = 3.8 μs

Context Switch = (2 * Transit Time) + 3.8 μs

Context Switch = (2 * 0.635 μs) + 3.8 μs

Context Switch ≈ 5.0 μs or 1330 cycles

5.5 Conclusions and Future Work

Implementing and comparing identical experiments to the real system in Section 3.5

allowed us to verify our models and continue testing with confidence. However, the main

advantage of the simulations is that any algorithm for SCRAMNet+ memory systems can

be implemented and tested. Therefore, future work should include porting and simulating

the compare and swap so that it can be studied under heavy contention. Finally, the

renaming algorithm from [9] should be implemented and tested.

Another product of the simulation was a closer understanding of the SCRAMNet+ card’s

operation Although the SCRAMNet+ memory is 32-bit aligned, the CSR offsets are 16-

bit registers and CSR4 and CSR5 are not contiguous. Combining these two registers

would allow one 32-bit read get the interrupt FIFO information, eliminating a 1

microsecond access time delay. As mentioned in Section 2.2, SCRAMNet+ interrupts are

automatically disabled after the first interrupt until the ISR finishes. The ISR re-enables

them by writing to CSR0. This also adds a 1 microsecond access time delay. Designing

the card to automatically re-enable interrupts when the ISR reads the combined CSR4

and CSR5 registers and sees the FIFO empty would eliminate this delay. One might think

58

that redesigning the card is unreasonable, however Systran is currently developing a new

version of the SCRAMNet+ card.

59

6 Summary and Conclusions

We presented both blocking and non-blocking synchronization algorithms for

SCRAMNet+ systems. These algorithms were tested with both real-system experiments

and simulations.

First, we reviewed a mutual exclusion algorithm suggested by the manufacturer, Systran

Corp. After discussing its shortcomings, namely poor scalability and starvation, we

present our own mutual exclusion algorithm, which exploits unique features of the

SCRAMNet+ hardware. Our results comparing the two algorithms indicate that our

algorithm has faster execution times, both with and without contention, regardless of the

size of the network. More importantly, our results demonstrate that our algorithm is more

scalable than Systran’s, and is fair unlike Systran’s algorithm. Our algorithm also has the

advantage that its design does not require the nodes to be prioritized. Although, one could

be provided if necessary by simply sorting the queue in the ISR. This would guarantee

that the critical section would be granted in order of priority. In contrast, our experiments

show that Systran’s algorithm cannot guarantee any prioritization.

Next, we presented non-blocking algorithms for SCRAMNet+ systems. First, we

designed and implemented a Compare and Swap algorithm. We used experiments on a

real system to test the algorithm. We then used this algorithm to implement lock-free and

wait-free constructions for large objects developed by Anderson and Moir. Experiments

were performed on both lock-free and wait-free implementations of a shared queue.

60

These experiments tested the algorithms and demonstrated that they could be

implemented on memory architecture as unique as SCRAMNet+.

Unfortunately, the lack of hardware prevented extensive experiments with large networks

or with heavy contention. Therefore, we developed a simulator based on Augmint, which

allows modification of a library called the backend to implement different memory

models. We implemented three backends to simulate the Write-Me-Last, interrupt and

polling configurations of the mutual exclusion algorithms. We verified the simulation

against our real-system experiments and continued with experiments for large networks

and heavy contention. The simulations gave us insight to the timing of the SCRAMNet+

hardware, allowing us to suggest simple hardware changes to improve the performance of

the next hardware design that is currently under way. Most importantly we have a solid

model to build other simulations.

Future work should include simulation of the non-blocking algorithms presented in this

paper. We also believe that both our mutual exclusion and CAS algorithms could be

implemented directly on SCRAMNet+ hardware. This would both eliminate timely

context switches and make the implementation transparent to the programmer. Currently,

the programmer must incorporate the algorithms into the ISR, which might already be

used by the programmer.

61

Appendix A

SCRAMNet+ SoftwareBoth a driver and API (Application Programmer’s Interface) library were developed to

implement the tests on real hardware.

A.1 SCRAMNet+ Driver

The SCRAMNet+ driver was developed for the RT-Mach operating system on 80x86

Intel architectures. The driver was built on the rk97a version of RT-Mach. The following

files were modified to configure the kernel to build the new driver:

./rtmach/src/mk/kernel/conf/i386/files

./rtmach/src/mk/kernel/conf/i386/MASTER

./rtmach/src/mk/kernel/conf/i386/MASTER.local

./rtmach/src/mk/kernel/i386at/autoconf.c

./rtmach/src/mk/kernel/i386at/conf.c

The following files were modified to change pcibus_read() and pcibus_write() from static

to global calls. They were need by the SCRAMNet+ driver to configure the PCI FIFO

and interrupt registers.

./rtmach/src/mk/kernel/i386at/pcibus.c

./rtmach/src/mk/kernel/i386at/pcibus.h

62

The following files were modified to implement the SCRAMNet+ driver itself:

./rtmach/src/mk/kernel/i386at/scramnet.c

./rtmach/src/mk/kernel/i386at/scramnet.h

./rtmach/src/mk/kernel/i386at/scramnet_defs.h

./rtmach/src/mk/kernel/i386at/scramnet_ioctl.h

A.2 SCRAMNET+ API

Systran supplies an API to interface to SCRAMNet+ cards. We developed a library called

scrplus to interface to our driver with the identical function prototypes as the

SCRAMNet+ library. A full description of the SCRAMNet+ library can be found in [13].

This way our test code could be easily ported to any existing operating system and

platform supported by Systran. We only implemented the functions necessary for our

testing as follows:

A.2.1 scr_mem_mm

Prototype: unsigned int scr_mem_mm(int arg)

This function maps or unmaps the SCRAMNet+ card’s memory to the API library. The

action is based on the values (MAP or UNMAP) passed for arg. A zero is returned on

success. After success, calling get_base_mem() will return the address of the

SCRAMNet+ card’s memory.

A.2.2 get_base_mem

Prototype: unsigned long int get_base_mem()

63

This function returns the address of the SCRAMNet+ card’s memory.

Scr_mem_mm(MAP) must be called before this function will return a valid value.

A.2.3 scr_csr_read

Prototype: unsigned short scr_csr_read(unsigned int csr_number)

The SCRAMNet+ cards use 16 Control/Status Registers (CSR) to configure and monitor

the status of the card. This function returns the values of the CSR register indicated by

csr_number.

A.2.4 scr_csr_write

Prototype: void scr_csr_write(unsigned int csr_number,

unsigned short value)

This function writes the value of value to the CSR number indicated by csr_number.

A.2.5 scr_id_mm

Prototype: void scr_id_mm(char *id, char *cnt)

This function assigns the node number to id and the total number of nodes in the network

to cnt. Valid values for both id and cnt are in the range 0-255.

A.2.6 scr_acr_read

Prototype: unsigned char scr_acr_read(unsigned long mem_loc)

As mentioned in Section 2.2, each 32-bit address has an associated memory location to

configure its interrupts. These location are called Auxiliary Control RAM (ACR). This

function will return the ACR value associated with the address mem_loc.

64

A.2.7 scr_acr_write

Prototype: void scr_acr_write(unsigned long mem_loc,

unsigned char acr_val)

This function will write acr_val to the ACR register associated with the address

mem_loc.

65

Appendix B

Using the SimulatorsThis section contains instructions on how to use both of our simulators and how to

duplicate the results in this paper.

B.1 Syntax

Any program linked with the Augmint library will accept three sets of parameters; for

Augmint, the backend and the simulator. The syntax is as follows:

run [Augmint Parameters] – [Backend Parameters] – [Application Parameters]

Note that the sets of parameters are separated by double dashes, which are required even

if no parameters are used. The executables for our simulators, all named run, are

contained in the Augmint directory tree as described in Table 6.

Simulator DirectorySCRAMNet+ Mutex ./applications/scramnetInterrupt Mutex ./applications/interruptPolling Mutex ./applications/polling

Table 6 Simulator executable directories

66

67

B.1.1 Augmint Parameters

B.1.1.1 -V

The –V parameter indicates that Augmint should use Data Movement as described in

Section 5.2.2. All of our simulations require the Data Movement option.

B.1.2 Backend Parameters

B.1.2.1 –n Xn

The –n parameter indicates the number of nodes in the simulation. The default value is

256, which is the physical limit of a SCRAMNet+ network.

B.1.2.2 –t Xt

The –t parameter indicates the transit time, in cycles, for a message to propagate from

one node to another. The default value is 169 cycles.

B.1.2.3 -r Xr

The –r parameter indicates the read delay, in cycles, used to simulate the host read access

time of the SCRAMNet+ card. The default value is 266 cycles (1 microsecond).

B.1.2.4 -w Xw

The –w parameter indicates the write delay, in cycles, used to simulate the host write

access time of the SCRAMNet+ card. The default value is 266 cycles (1 microsecond).

B.1.2.5 -c Xc

The –c parameter indicates the context switch time in cycles. The default value is 1330

(5 microseconds). The Write-Me-Last backend does not use this command line option.

B.1.3 Simulation Parameters

B.1.3.1 –PXp

The –P parameter indicates which node(s) are to participate in the test. The default value

is –1, indicating all nodes. Other valid values are nodes 1 through 256.

68

B.1.3.2 –NXn

The –N parameter indicates how many nodes there are. The default is 256. This number

should always be less than or equal to the number used with the –n backend option.

B.1.3.3 –MXn

The –M parameter indicate the total possible number of nodes in the system. The default

value is 256.

B.1.3.4 –CXi

The –C parameter indicates how many iterations the simulation should make. The default

value is 100.

B.2 Experiments

Scripts where used to generate results of our experiments. Table 7 lists the scripts used to

generate the results in this paper. The directory is based from Augmint as the root of the

directory tree. The figure column indicates which figures used the results of the script.

Directory Script Figure./applications/systran/results/ compare_nc_pid1 Figure 8 & Figure 9./applications/systran/results/ compare_nc_pid2 Figure 8 & Figure 9./applications/systran/results/ compare_c_all Figure 10 & Figure 11./applications/systran/results/ heavy_c_all Figure 14, Figure 15 & Figure 16./applications/interrupt/results/ compare_nc_pid1 Figure 8, Figure 9 & Figure 12./applications/interrupt/results/ compare_nc_pid2 Figure 8, Figure 9 & Figure 12./applications/interrupt/results/ compare_c_all Figure 10, Figure 11 & Figure 13./applications/interrupt/results/ heavy_c_all Figure 14, Figure 15 & Figure 16./applications/polling/results/ compare_nc_pid1 Figure 12./applications/polling/results/ compare_nc_pid2 Figure 12./applications/polling/results/ compare_c_all Figure 13./applications/polling/results/ heavy_c_all Figure 14, Figure 15 & Figure 16

Table 7 Scripts to run simulation experiments

Bibliography

1. Augmint User’s Manual. Unpublished manuscript. http://iacoma.cs.uiuc.edu/iacoma/augmint/users-guide.ps

2. J. Anderson and M. Moir, “Universal Constructs for Large Objects”, Submitted to IEEE Transactions on Parallel and Distributed Computing, 1997.

3. G. Barnes. “A Method for Implementing Lock-Free Shared Data Structures”, Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures, 1993, pp. 261-270.

4. T. Bowman, “Shared-Memory Computing Architectures for Real-Time Simulation – Simplicity and Elegance”, Systran technical paper available from http://www.systran.com/scramnet.htm, January 1997.

5. C. Filachek, “Evaluation and Optimizations of Lock-Free and Wait-Free Universal Constructions for Large Objects”, Master’s Thesis, University of Pittsburgh, 1997.

6. M. Herlihy, “A Methodology of Implementing Highly Concurrent Data Object”, ACM Transactions on Programming Languages and Systems, Vol. 15, No. 5, 1993, pp. 745-770.

7. M.Herlihy, “Transactional Memory: Architectural Support for Lock-Free Data Structures”, Proceedings of the 20th International Symposium in Computer Architecture, 1993, pp. 289-300.

8. M. Herlihy, “Wait-Free Synchronization”, ACM Transactions on Programming Languages and Systems, Vol. 11, No. 1, 1991, pp. 124-149

9. S. Menke, M. Moir, and S. Ramamurthy, “Synchronization Primitives for SCRAMNet+ Systems”, Proceedings of the 16th Annual Symposium on the Principles of Distributed Computing, 1998, pp. 71 –80

10. M. Moir, “Practical Implementations of Non-Blocking Synchronization Primitives”, Proceedings of the 15th Annual ACM Symposium on the Principles of Distributed Computing, Santa Barbara, CA, August 1997, pp. 219-228.

69

11. “PCI/PMC Interface Overview”, Technical Note 131, Copyright 1996, Systran Corp.

12. Anthony-Trung, Nguyen, Maged Michael, Arun Sharma, and Josep Torrellas. “The Augmint Multiprocessor Simulation Toolkit for Intel x86 Architectures.” Proceedings of 1996 International Conference of Computer Design, October 1996.

13. “SCRAMNet Network PCI Bus Hardware Reference”, Document No. D-T-MR-PCI#####-A-0-A2, Copyright 1991, Systran Corp.

14. “SCRAMNet Network Programmer’s Reference Guide”, Document No. D-T-MR-PROGREF#-A-0-A6, Copyright 1997, Systran Corp.

15. “SCRAMNet VME Hardware Reference”, Document No. D-T-MR-VME#####-A-0-A2, Copyright 1994, Systran Corp., pp. F1-F2.

16. Arun Sharma, Augmint, A Multiprocessor Simulator. Master’s Thesis, University of Illinois at Urbana-Champaign, May 1996.

17. Arun Sharma, Anthony-Trung Nguyen, and Josep Torrellas. Augmint: A Multiprocessor Simulation Environment for Intel x86 Architectures. Center for Supercomputing Research and Development (CSRD) Technical Report 1463, March 1996.

18. Edward Solari and George Willse, PCI Hardware and Software, San Diego; Annabooks, March 1995, pp. 434.

19. Systran Corp. World Wide Web Page. http://www.systran.com/scramnet.htm, January 1997.

20. Systran Corp. World Wide Web Page. http://www.systran.com/ftp/scramnet/snovervw.pdf

21. Jack E. Veenstra and Robert J. Fowler, MINT Tutorial and User Manual. Technical Report 452, University of Rochester, Computer Science Department, August 1994.

70

1 introduction - university of pittsburghpeople.cs.pitt.edu/~moir/papers/menkethesis.doc · web...

Documents