virtuoso: a virtual single processor programming system for distributed real-time applications

i i i

~> Microprocessing and Microprogramming •

ELSEVIER Microprocessing and Microprogramming 40 (1994) 103 115

Virtuoso" A virtual single processor programming system for distributed real-time applications

Eric Verhulst* Intelligent Systems International NV/SA, Lindestraat 9, B-3210 Linden, Belgium

Abstract

The Virtuoso programming system is a real-time programming framework that offers the same high level API across all target platforms from single processor to muitiprocessor systems. A very fast nanokernel that manages a set of light context processes is at the heart of the system. These nanokernel processes combine the speed of an interrupt Service Routine with the flexibility of a task and reduce considerably the latency of the application.

Key words: Virtuoso; Real-time; Parallel processing; Distributed processing; Portable software; Virtual single processor; Nanokernel; Microkernel; DSP

1. The challenge: keeping up with technology avoiding the risks

Processor technology is changing very rapidly. While this opens new application possibilities, the system developer's decision on a processor type can have a dramatic impact on the life-cycle of the application. Competitive pressure favours the choice of the latest and most performing processors, implying the risk that any production delay from the processor's manufacturer can seriously

* Tel.: +32 1662 1585, fax: +32 1662 1584, email: virtuoso(a bix.com

jeopardize the application project. In addition fundamental I/O bandwidth limitations force the designers to go parallel to reach a high level of performance. The solution is the use of a portable programming tool that spans all families of processors without any compromise on the processor specific features. The ideal development tool must not only provide for faster application development by giving the programmer a head start, but should also be future proof. The requirements can be split in three areas: (1) A consistent high level API across all target

processors; (2) The utilities to debug and maintain the code; (3) Target processor specific support.

0165-6074/94/$7.00 © 1994 Elsevier Science B.V. All rights reserved SSDI 0165-6074(93)E0083-8

104 E. Verhulst / Microprocessing and Microprogramming 40 (1994) 103-115

The first two items will be discussed in general terms, while this paper concentrates on the Virtu- oso nanokernel, which was specifically designed for adequately supporting DSP target processors but has a number of interesting features that are more widely applicable. In particular any target system where hard real-time constraints have to be solved in a parallel processing context will benefit from the reduced latency.

2. The high level view: A portable set of services

2.1. A multi-tasking real-time kernel as the essential module

In embedded applications, often the range of functions to be executed widely varies. Often their execution must proceed within strict time intervals. Failure to do is considered a system malfunction with consequences ranging from the benign to the catastrophic. A flexible solution is the use of a multi-tasking real-time kernel that manages the timely execution of the system functions by way of a priority driven preemptive scheduling algorithm. In this solution functions are mapped onto tasks as independent units of execution, while the timing requirements are mapped onto the relative priorities of the tasks. For example, if external events result in a task with a higher priority than the current one becoming runnable, the kernel will preempt the currently running task and start executing the higher priority one. This behavior also leads to a better usage of the CPU resources as it avoids any active waiting. Note that the mapping of timing constraints into priorities itself is not solved by the kernel but is left to the application developer. Other solutions exist, e.g. deadline schedulers that dynamically adapt the priority as a function of the next deadline of a task. Because of the complexity of the implementation and the asso- ciated overhead, especially in a parallel processing

context, these solutions are rarely adopted. In most applications priority driven scheduling is sufficient to address periodic timing constraints. For handling aperiodic timing constraints, e.g. an excep- tional event or situations like priority inversion, the Virtuoso system permits the developer to change the priority dynamically from within the application level.

The actual implementation of a multi-tasking real-time kernel is most often done by executing the tasks under the supervision of a kernel or operating system. The essential activity of the kernel is to execute the priority driven scheduling algorithm and to save and restore the context of the tasks upon preemption. The scheduling is influenced by two major sources. On the one hand internal events (e.g. timer circuits) or external events (e.g. external interrupts) can preempt the currently running task. On the other hand, the tasks themselves can request services from the kernel (e.g. message ex- change) that cannot be fulfilled immediately (e.g. because they require another task to reach a synchronization point). In a multiprocessor system, the preemption is also influenced by the requests gener- ated on other processors. This means that the kernel on a multi-processor target must be designed to fulfill the real-time requirements at the system level. Hence a solution whereby single processor real- time kernels are connected using communication drivers is often not sufficient to handle hard real- time demands in a parallel processing environment.

The preemptive scheduling itself generates an overhead that often is measured in tens of microseconds. For DSP applications in particular this can be unacceptably high especially if the communication is also handled at this level. Therefore interprocessor communication of any sort must be handled with the same priority and reduced overhead used for handling interrupts.

In the next part we define the high level services offered by the Virtuoso microkernel. They are dis- tinguished by fully distributed semantics. Adequate

E. Verhulst/ Microprocessing and Microprogramming 40 (1994) 103-115 105

performance for DSP targets is obtained by the use of a lower level but very fast nanokernel that brings the overhead in the submicrosecond region. This is explained in the second part of this paper.

which some of the functions are executed by a server program on a host machine.

2.3. The object as the unit of distribution

2.2. Classes of microkernel services

The Virtuoso programming system at the microkernel level provides the same API by way of a virtual single processor model independently of the number or type of interconnected processors that are actually being used. This covers targets from single 8-bit microcontrollers to multi 32-bit DSP systems.

The high level Virtuoso programming model is based on the concept of microkernel objects. In each class of objects, specific operations are allowed. The main objects are the tasks as these are the originators of all microkernel services. Each task has a (dynamic) priority and is implemented as a C function. Tasks coordinate using three types of objects: semaphores, mailboxes and FIFO queues. Semaphores are signalled by a task following a certain event that has happened, while other tasks can wait on a semaphore to be signalled. To pass data from one task to another, the sending task can emit the data using a FIFO queue or use the more flexible mailboxes. While the first type provides for buffered communication, mailboxes always provide a synchronized service and permit the transfer of variable size data. Filtering can be performed on the desired sending or receiving task, the type of message and the size. A special direct copy service is also provided for asynchronous datatransfers. Further services available with the Virtuoso microkernel are the protection of resources and the allo- cation of memory. The microkernel also uses timers to permit tasks to call a microkernel service with a time-out. Depending on the processor type, some microkernel calls also provide direct access to communication hardware and high precision timers. Finally, the C programmer disposes of a distributed standard I/O, graphics and runtime library of

In a traditional single processor real-time kernel, objects are identified most often by a pointer to an area in memory. This methodology cannot operate across the boundaries of a processor as a pointer is by definition a local object. Virtuoso solves this problem by a system-wide naming scheme that relates the object to a unique identifier. This identifier is composed of a node identifier part and an object identifier part. This enables the microkernel to distinguish between requested services that can be provided locally and those services that require the cooperation of a remote processor. As a result, any object can be moved anywhere in the network of processors without any changes to the application source code. A possible mapping of objects into a real topology is illustrated in Fig. 1. Note that each object, including semaphores, queues and mailboxes could reside as the only object on a node. In this context we emphasize that with Virtuoso the node identifier is nothing more than an attribute of the object.

The transparent distributed operation would not work if the semantics of the microkernel services and their implementation were not fully distributed. This imposes a certain programing methodology. For example, global variables or pointers are allowed but the programmer must be aware that their use will result in non scalable programs. Any task or datastructure containing references to global pointers must be executed on the same processor. When common memory is used, the objects can be protected using a resource lock.

2.4. A multi-level approach.for speed and flexibility

As any designer knows, a single tool or method cannot cover all of the different aspects of an

106 E. Verhulst/ Microprocessing and Microprogramming 40 (1994) 103-115

cepts interrupt I Handles I Signa J

terrupts disab~ / P . . . . . F I TeaSvvk `

J Level J AJl registers (saved by microkemel)

9 Predefined subset of registers {saved by nenokemer)

v As many as ISR1 needs (saved by IRS1)

As many as ISRO needs (saved by IRSO)

Fig. 2. Multiple levels of context managed by the Virtuoso kernel.

] Kernel Object

] Procelmor Node

Fig. 1. Virtuoso conceptual model and a possible mapping into a real topology.

application. In particular DSPs are increasingly used for signal processing and embedded control at the same time. This poses quite a challenge to the programming tool as it must handle timing constraints expressed in microseconds while offering the flexible semantics of a priority driven preemptive real-time kernel, the latter resulting in a much higher overhead. As a result traditional real-time kernels are rarely used for demanding DSP applications, forcing the designer to program in assembler at the interrupt level. Unfortunately, Interrupt Service Routines can be very hard to program, especially if several sources of interrupt must be handled. The reasons are manifold: (1) With no adequate support from the hardware,

interrupt are disabled when inside an ISR; (2) An ISR must be executed until its end as it is

a critical section (no wait state possible);

(3) Handling multiple interrupt sources can result in difficult to maintain and complex code.

The better processors provide several priority levels for interrupt handling and permit interrupts to be nested, but is still a hard programming job as any change of the system requirements can have effects on the temporal execution behavior of other ISRs.

The basic problem is that programming at the interrupt level provides little means for modularity. Hence, building complex multitasking applications is almost impossible. Real-time multitasking kernels on the other hand provide for modularity but impose an unacceptable overhead for handling fast interrupts at the task level.

The Virtuoso programming system solves this Gordian Knot by providing an open multilevel system build around a very fast nanokernel manag- ing a number of light tasks or processes (Fig. 2). The user can program his critical code at the level he needs to achieve the desired performance while keeping the benefits of the other levels. Internally, the Virtuoso kernel manages the processor context as a resource, only swapping and restoring the minimum of registers that is needed. The different levels are described below. ISR stands for Interrupt Service Routine. Note that only two ISR levels are discussed. This corresponds with the Virtuoso

E. Verhulst / Microprocessing and Microprogramming 40 (1994) 103-115 107

U,'

] ~'~'N an° ke~t ] Global Interrupts Enabled

I ! I Global Interrupts Disabled C ,s,o

HW Interrupt

Fig. 3. Multi level support mechanism in Virtuoso.

version on the Texas Instruments TMS320C30/C31 and C40 DSPs that have only one hardware level of interrupts. On some target processors (e.g. the Motorola DSPs), multilevel ISRs are supported (Fig. 3).

2.4.1 Level 1: ISRO level This level normally only accepts interrupts from

the hardware. Interrupts need only be disabled during ISR0 (less than 1 microsecond with Virtu- oso on a C40). The developer can handle the interrupt completely at this level if required or pass it on to the higher levels. The latter is the recommended method as it leaves global interrupts enabled. The programmer himself is responsible for saving and restoring the context on the stack of the interrupted task.

2.4.2 Level 2: ISR1 level The ISR1 level is invoked from within an ISR0, it

is used for handling the interrupt with global interrupts enabled. An ISR1 routine must itself save and restore the context but permits interrupts to be nested. An ISRI routine can often be replaced by a nanokernel process that is easier to program as it has all the characteristics of a task.

2.4.3 Level 3: The nanokernel level (process level) The nanokernal level is composed of tasks with

a reduced context, delivering a context switch in less than 1 microsecond on a C40. To distinguish it from the microkernel level, we have called the nanokernel tasks processes. Three types of primi- tives are available for synchronization and communication. Each process starts up and finishes as an assembly routine, can call C functions and leaves the interrupts enabled. Normally one will only write low level device drivers or a very time critical code at this level. Because of the very low overhead and round-robin scheduling, it is also ideal for pure dataflow driven applications. One of the light tasks is the microkernel itself that manages the (preemptive) scheduling of the microkernel tasks Isee below)

Processes have the unique benefit of combining the ease of use of a task with the speed of an ISR. Besides the fact that they can communicate and synchronize, they can also be allowed to wait and deschedule, which is not feasible for an ISR. In a multi-processor system they play an essential role. Without the functionality of nanokernel processes, the designer has the option either to program at the ISR level and hence often disabling interrupts because of critical sections, or to program at the C task level but resulting in much increased latencies. The latter effect is worse in a multiprocessor system as interprocessor communication has to be handled as high priority interrupts because if not acted upon as fast as possible, it can delay the remote processor that is the source of the interrupt.

2.4.4 Level 4: The microkernel level (task level) This is the standard C level with over 70 micro-

kernel services. This level is fully preemptive and priority driven and each task has a complete context. It provides a high level framework for building the application as explained in the previous para- graph. Most real-time kernels only provide a single


ISR level and the C task level as this is sufficient for supporting applications using standard microprocessors and microcontrollers. The latter is also the case for ports of Virtuoso to this class of processors. It must be noted that in general nanokernel services take about 10 to 20 times less time than the equivalent services at the microkernel level. This is not only due to the difference in number of registers that have to be swapped but mainly because of the much richer semantic context of the microkernel services (see below for details).

3. Description of the nanokernel

are: (1) No preemption, but interruptable by interrupt

service routines; (2) Strict F IFO scheduling; (3) All nanokernel services are synchronous; (4) Only one waiting process allowed when syn-

chronizing with another process; (5) No time-outs; (6) No distributed semantics.

As this results in an overhead reduction with a factor of at least 5, the restrictions are not important if the microkernel level is present and if nanokernel processes are used in an appropriate way.

3.1. Nanokernel processes and channels 3.2. Nanokernel channels

The nanokernel units of execution can be considered as a light task, that is a task with a smaller context as compared with the microkernel tasks. To avoid any confusion, the following terminology is introduced: (1) Processes for designating the nanokernel tasks; (2) Channels for designating the interprocess com-

munication objects. The light context results in a much smaller over-

head when the nanokernel has to switch between two processes than at the microkernel level. The light context can be decomposed into several com- ponents: (1) A small number of registers that have to be

saved over a context switch; (2) A minimum semantic context.

The small number of registers means that less time is needed to save and restore the context but also that the code is written in assembly as programming in C necessitates to save the whole context (i.e. all registers).

The minimum semantic context means that the nanokernel processes are characterized by simple but very efficient services as compared with the microkernel level. The most important restrictions

Nanokernel processes can synchronize using three types of channels. They are characterized by a pointer to a waiting process and the relevant datastructures. The latter can be of the following type: (1) A counting semaphore; (2) A linked list; (3) A stack.

In assembly (TMS320C40), the channel types are described as follows:

*C _CHAN; counting semaphore based channel CH _PROC.set 0; waiting process or NULL CH _NSIG.set 1; number of events

*L _CHAN; linked list based channel CH _PROC.set 0; waiting process or NULL CH _BASE.set 1; base of stack CH _NEXT.set 2; current stack pointer, ascending

empty

3.3. Nanokernel services

The nanokernel processes have a simpler scheduling mechanism and set of services than the


microkernel tasks. Nanokernel processes are never preempted by another nanokernel process (and hence are by definition critical sections with respect to other processes). Nanokernel processes only deschedule voluntarily upon issuing a kernel service. They are executed in pure F I F O order. Note however that they can be interrupted by an ISR level routine but will themselves preempt any microkernel task when becoming executable. Hence, one can consider the nanokernel level as a set of high priority processes while the microkernel tasks have a lower priority. Note that the microkernel itself is a nanokernel process. This facilitates its implementation as it is automatically a critical section.

Most nanokernel services are assembly routines. As parameters are passed using registers, no general syntax can be provided as it is processor de- pendent. As they start up and terminate in assembly, a good know-how of the target processor is required.

The semantics of the nanokernel services are quite different from the microkernel services. We summarize the most noticeable differences: (1) All services are local (i.e. no distributed execu-

tion); (2) Nanokernel processes are scheduled FIFO; (3) Only one waiting process is allowed; (4) Each nanokernel process is a critical section

with respect to other processes.

3.3.1 Process management The actual syntax might be different depending

on the target processor as the implementation de- pends on the registers and instructions available. The descriptions below are therefore generic.

init _process (void *stack, void en t ry (void), int pa ram 1, int param2, ...) ;

/* Set up a nanokernel process */ /* C function called from microkernel level */

s t a r t _ process (void *stack); /* Starts up a nanokernel process */ /* C function called from microkernel level */

nanok_y ie ld /* Yield CPU to another nanokernel process */

3.3.2 ISR management end_ i s r0

/* Terminates a level 0 ISR */ e n d _ i s r l

/* Terminate a level 1 ISR and awake nanokernel condit ionally */

s e t _ i s r l /* Enter level 1 ISR */

3.3.3 Semaphore based services prhi _ sig

/* Signal and increment a semaphore */ prhi _wai t

/* Wait on semaphore to be signalled */

3.3.4 Stack based services p r h i _ psh

/* Push data onto a stack */ p r h i _ p o p w

/* Pop data from stack */ / * wait if s tack empty */

prhi _pop /* Same as above but with no wait ing */

3.3.5 Linked list based services prhi _pu t

/* Inse r t data onto linked list */ prhi _getw

/* Remove data from linked list */ / * Wait if list is empty */

prhi _get /* Same as above but with no wait ing */

3.4. Transitions between the Virtuoso kernel levels

As each level has its own set of services, an interface mechanism has been implemented that permits to pass from one level to another. While it


~'~--Tr----- IdleYuk i

Ai • LIIT~kl

L~ITd~k3 I I II II

0 0 @

[[ |

@ 0

rSRI .I

I I ]S 1~1-4

r=,i

0 T= ' ISRO.I

rsRoa U IsRo-~

Fig. 4. An execution trace of a hypothetical Virtuoso program segment.

is perfectly possible that a lower priority level awakens a high priority one, from the application side one only needs to wait on the higher priority levels to have reached a point of synchronization. In fact whenever a microkernel task requests a service, the microkernel (running at higher priority) is made executable from within the task. Thus this mechanism is hidden from the user as it is part of the implementation.

The whole mechanism will be illustrated by showing the handling of an interrupt on which a microkernel process is waiting. In practice, one can use a subset to address the application.

An ISR0 routine triggered by the interrupt, enters the ISR1 level by the set_lSR1. This will globally enable the interrupts and increment an interrupt counter. The ISR1 routine can signal a nanokernel semaphore and terminate. When a microkernel task was waiting on this signal (with the KS,EventW(IRQ) call) the microkernel will then become executable and decrement the interrupt counter. The waiting microkernel task is then made executable. This also shows that a nanokernel process can further process the interrupt independently without any microkernel task.

3.5. An execution trace illustrated

In Fig. 4, we illustrate a hypothetical program segment that illustrates the interaction between the different levels more in detail. This example is in- spired by the TMS320C40 multiprocessor version of Virtuoso. The order of magnitude of the timestep is indicated in microseconds. As can be seen any lower level can preempt an activity executing at any higher level and has an effective higher priority.

The processor accepts an interrupt (1). This can be an external interrupt, a comport interrupt or a timer interrupt. The interrupt is passed on to a higher level for further processing. This disables interrupts for less than one microsecond on a C40.

An ISR enters ISR1 level (2). This can only be to further process an interrupt accepted by the ISR0 level. On processors with support for multi-level interrupts an ISR executing at level 1 can be preempted by an interrupt of a higher priority.

The microkernel is invoked (3). This can happen as a result of a signal coming from an ISR0, a microkernel service request from a task, a task entering a wait state or an event raised by an ISR1 or a nanokernel process. The microkernel is a nanokernel process that waits on any of these events to happen.

A nanokernel process is executed (4). In this example an ISR0 could have triggered a comport receiver driver that passed on the packet to a comport transmitter driver to forward the packet to another processor.

4. Virtuoso system support for debugging and maintaining code

Most of the tools that are available for the particular compiler/assembler and target processor can be reused without any problem. However it is also important to use tools that permit to define the different objects used in the application at a higher


level without forcing the user to be aware of all the small details of the implementation. For this reason Virtuoso comes with a system generation utility. The objective of this utility is twofold.

First of all it helps the programmer to structure his application. In the system definition file, the programmer enumerates the microkernel objects with their symbolic names and with the required attributes. For example, he will tell the system that CONIDRV is the symbolic name for a task that has the C function eonmdx'v () as entry point. This task has a stack size of 512 bytes, runs on processor 4, is part of the I/O taskgroup and requires no FPU. From here on the programmer only has to remember the symbolic name CONIDRV in order to use this task in his program. The system generation utility generates all include files automatically. These include files are the actual C source code that enables the microkernel and the application tasks to use the defined objects. The routing tables gener- ated permit the definition of networks with heterogenous types of comport drivers, support unidirec- tional links and exploit communication paths of the same length.

Secondly, the system generation utility is a main- tenance tool as well. For example, when any of the attributes (e.g. the node identifier) is changed, the user only has to invoke the system generation utility to automatically generate a new executable program. Note that in the case of Virtuoso this also includes the routing of microkernel services as the application program's logic is independent of the actual underlying topology. As a result, the programmer can fully concentrate on his application.

During development, the use of a debugger is useful to identify programming errors. Therefore a task level debugger is provided. This debugger runs as a separate task that is only awakened when the user invokes it. As such, the debugger only uses some memory while the only impact on the system's performance is the saving of critical informa- tion by the microkernel. The latter operation can

be disabled when the debugger is not being used. When invoked, the debugger suspends the application, including all tasks located on other processors. The user is presented with a menu that permits to inspect the status of all microkernel related objects, such as tasks, semaphores, queues, mailboxes, etc. The task level debugger works fully distributed permitting the developer to jump from one processor to another.

A tracing monitor that is tightly integrated with the debugger permits to trace the execution history of the program. At each microkernel event the absolute time is recorded together with the active task, the time interval from the previous event and the event or requested microkernel service. On some processors the time unit is 100 ns, permitting a very precise verification of the timing and synchronization of the application. While the tracing activity takes upto a few microseconds for each entry, the user has the capability to selectively alter the level of details traced. When only the microkernel context switches are traced, the impact on the system is negligible.

Finally the user can also ask the microkernel to provide him with the current CPU workload. The latter is useful for runtime load balancing.

5. Life cycle support across different technologies

One of the central ideas behind Virtuoso is processor and topology independence. This was achieved by the use of ANSI C and the virtual single processor model. While the use of a high level language together with the system generation utility separates the application from the physical target hardware, Virtuoso went further by applying similar principles to its own design. When Virtuoso is ported to a new processor target, only the processor specific operations such as task switching, interrupt handling and bootloading have to be re- designed as the major part is written in optimized

112 E. Verhulst / M icroprocessing and M icroprogramming 40 (1994) 103-115

ANSI C. Hence, Virtuoso has been ported to most popular processors often in a few weeks. The result is that Virtuoso is available with the same API on 8-bit microcontrollers, standard 8-bit, 16-bit and 32-bit CISC and RISC microprocessors and word- oriented DSPs. The virtual single process model also transparently supports systems with different types of interprocessor communication media (point to point links, common memory and even LANs). In this perspective systems can be built from different types of processors using each of them for a particular function. Virtuoso provides the glue that ties all processors together. The first of Virtuoso that supports this functionality supports in a fully transparent way mixed networks of transputers and C40 DSPs.

A particular distinction can be made between system services and application specific compute tasks. Virtuoso follows this line of separation of functions by providing microkernel services on each node in the system, while typical standard I/O services are performed using a server task running on a host machine (e.g. a PC or UNIX worksta- tion). The distributed protocol ensures that the programmer must not be aware of this separation while maximum performance is obtained for the computational tasks. The Virtuoso model permits to cross develop on the host in a first step (e.g. the PC), and then download application specific tasks to the compute nodes in a second step.

For multi-processor targets, when the invoked service is located on the same processor as the calling task, there is not much difference with the single processor operation. When the microkernel detects that the required service requires the cooperation of another processor, a command packet is built that is passed on to the routing layer. The routing layer (implemented at the nanokernel level for minimum latency) handles all messages in order of priority while extra handshaking is performed to assure the atomicity of the microkernel services. For maximum performance, the data is packetized

and send in parallel over all communication paths with the same minimum cost (so called 'fat links' technique). The hard real-time characteristics are guaranteed by using priority ordering at all times. Note that the packetising is also necessary to min- imize the effects of CPU monopolization during intensive communication phases of the application.

The message based protocol of Virtuoso is powerful, partly by its portability but also by its simplicity and elegance. It is also very secure as it isolates all memory operations from one task to another. This is especially true for common memory systems where this kind of potential erroneous operations is left to the kernel. As such, distributed as well as common memory systems can be supported in the same and straightforward way. One of the kernel services permits a kind of virtual common memory operation by copying directly from a source to a destination pointer located on different processors. The data copy itself is an operation of the router and requires no intervention of the kernel except on the source or the destination processor. This provides for a minimum hop-delay and is less intrusive on the application.

Currently supported processors include INMOS T2xx, T4xx, T8xx, Texas Instruments TMS320C30/C31/C40, Motorola 68xxx, Intel 80x86 (real mode), Motorola 68HCll, 68HC16, Mips R3000, Motorola 96002, 56002, Analog Devi- ces 21020, AT&T 32C (port in progress). The IN- MOS transputer and C40 version permit to build fully transparent heterogenous applications.

6. Performance figures on multi-processor targets

It is essential to understand the impact of distrib- uting the application across several processors. If for example a single processor has to handle 16 I/O channels, each requiring some control algorithm to be performed on it, it is clear that there is an upper limit to the number of samples that can be handled

E. Verhulst / M icr oprocessing and Microprogramming 40 (1994) 103-115 113

Fig. 5. Virtuoso test set-up: 4 processors in a pipeline.

KS_Signal(S2)

~s_w~it(sl)

KS_Wait(S

KS Signal(S

Fig. 6. Two-way synchronization benchmark.

per unit of time. In the case of the INMOS T800, the theoretical upper limit can be estimated at 500 Ksamples/s. If however each sample needs an additional processing algorithm that takes 500 ~ts, the upper limit drops to around 1.5Ksamples/s. With Virtuoso it is very simple to move the processing tasks to another processor without any or minor changes to the source code. In order to make this approach viable, we need to be sure that the communication that is introduced has ac- ceptable performance limits. We will illustrate this with a few examples. The processors used are IN- MOS transputers running at 25MHz or C40s at 40 MHz.

The first example is a dynamic workload balancing example running on hardware organized as in Fig. 5. This is not an optimal topology but it is ideal for testing worst case overheads. The system consists of a pipeline of 5 tasks. The first task generates at regular intervals a data packet that is processed by the remaining tasks. Each packet requires about 4 ms processing time. The last task also counts the number of packets processed. With all five tasks on one processor (an INMOS transputer running at 25 MHz), this configuration has a maximum processing rate of around 220 packets/s. Virtuoso has a build-in workload monitor. The main control program is programmed in such a way that when the workload reaches 95%, we suspend one of pipeline tasks on the first processor and start up an identical copy of it on a secondary processor. In order to realize this only the use of the KS _Suspend() and KS _Resume() kernel calls is required, as the tasks still use the same queues

and the kernel takes care of the redirection of the dataflow. When all five tasks are moved to different secondary processors, we measure a maximum processing rate of 840 packets/second. Note that this example uses the distributed graphics server, adding an overhead to the system for displaying the workload graphically on the screen.

The second example (Fig. 6) is a set of bench- marks running again on the same hardware configuration. This simple set-up is again not an optimal topology but illustrates very well the effect of hop-delay encountered in more complex networks.

Microkernel services for the C40 single processor version with no resulting context switch:

enqueue 1 word 8 las signal semaphore 10 las Microkernel services for the C40 single processor

version with resulting context switch (measures 4 kernel services + two context switches):

signal/wait round-trip time 27 ~ts enqueue/dequeue round-trip time 32 ~ts send/receive message round-trip time 36 ~ts In the third example two tasks synchroni-

zing bidirectionally with acknowledge using KS_Signal(S2)/KS_Wait(S1) and K S _ W a i t (S2)/KS_Signal(S1) according to the diagram in Fig. 6 gives (when measuring the round-trip time in the first signalling task) the following times. This test involves 4 microkernel calls, two context switches and a total of 6 network packages. Each microkernel call causes a context switch.

114 E. Verhulst/ Microprocessing and Microprogramming 40 (1994) 103-115

Test INMOS TI T800-25 MHz TMS320C40-40 MHz

Both tasks 146 its 76 its local (0 hops)

Second task 264 its 104 its on processor 2 (lhop)

Second task 372 its 146 its on processor 3 (2 hops)

Second task 480 its 188 its on processor 4 (3 hops)

'Hopdelay' 54 its 21 ~ts

It must be noted that the 'Hopdelay' measures effectively the minimum communication overhead for implementing a distributed microkernel service. It consists of the following operations needed for retransmitting 52 byte long command packet: (1) transfer using DMA, (2) when DMA transfer terminates the interrupt is

raised; (3) starting up the receiver process, (4) receiver calls the router function, (5) starting up the sender process, (6) starting up the DMA engine, (7) retransmission of the packet.

With the 21 las hopdelay for the C40, this basi- cally means that the system can handle about 500000 microkernel services/second.

The following example tested on a 40 MHz C40 illustrates the performance at the nanokernel level. Two processes successively signal and wait on each other using an intermediate counting semaphore (Signal - context switch - Wait - Signal - context switch - Wait). Round-trip time measured: 5775 nanoseconds or less than one microsecond per operation. This time must be compared with the equivalent timing of 27 microseconds at the microkernel level. The other nanokernel service times are of the same Order of magnitude.

The last example measures the data throughput rate between tasks on different processors:

On transputer: 4.6 MB/s between tasks on two different processors using 3 links. With one link the limit is 1.76 MB/s.

On TMS320C40:29.7MB/s between two 40 MHz C40s with two links. Actually the internal bus was saturated as all data and code were on the same bus. With one link the limit is 17.6 MB/s.

6.1. A note on the I N M O S transputer version

The attentive reader will certainly have noticed a similarity between the Virtuoso nanokernel and the high priority FIFO-queue of the transputer. On the transputer the corresponding ISR0, ISR1 and the nanokernel levels are in fact implemented most- ly in hardware. In addition, interprocessor communication is reduced to a single instruction with automatic use of DMA. As a result, the development of Virtuoso was greatly simplified versus the TMS320C40 version. The hardware support is also reflected in the timings. While the C40 presents a tenfold increase in link banwidth, the set-up times are roughly equivalent if one takes account of the difference in clock frequency.

7. Conclusion

With the Virtuoso programming system the developer is using a family of real-time kernels that provide an identical high level API across all processors. Thanks to the use of portable C, it shields the developer from technology changes during the life cycle of his application. Moreover Virtuoso being a fully distributed real-time kernel, it also provides a virtual single processor model that greatly facilitates the programming of parallel and distributed systems. For maximum performance in particular on DSP targets, a four level support was implemented based on a nanokernel.


References

[1] Preemptive process scheduling and meeting hard real-time constraints with. TRANS-RTXc on the Transputer, Ap- plications of Transputer 2 (lOS Press, 1990). Paper presented at 'Transputer Applications '90' Conference in Southampton (July 1990).

[21 Predictive interrupt response times and portable hard real-time applications with TRANS-RTXc, Paper presented at the Occam User Group 13 (Sept. 1990, York).

[3] Implementation issues of TRANS-RTXC, First 1FAC workshop on Algorithms and Architectures for Real Time Control, Bangor (11-13 Sept. 1991).

[41 RTXC/MP, a distributed real-time kernel defined for a virtual single processor, Internat. Conf. on Signal Processing Applications and Technology, Boston (Nov. 1992).

Eric Verhulst. Professional experience: 1981-1983: Coordinator between the Belgian Army and the Industry for the Improved HAWK conversion program. Commander of an Improved HAWK direct support unit. 1984: Head of the instruction section for opto-electronics at the Logistics School of the Belgian Army. 1985-1988: Analyst at the EDP center of the Army. Responsible for the com- puterized logistic system of the Army. 1988-1989: Researcher at the Signal and

Image Processing Lab at the Royal Military Academy, Brussels. Specialised in parallel processing. Since June 1989: Full-time general manager of Intelligent Sys- tems International. Since May 1986: Founding and start-up of Intelligent Systems International, a company specialising in parallel processing. This is backed by a broad range of activities ranging from conceptual research, applications research as well as acquiring the necessary know-how and managerial skills. 1SI was initially started as a part-time activity, but engaged in commercial activities from the beginning. In June 1989, this became a full-time activity. ISI has become a limited company in December 1989. First product is TRANS-RTXC, an industries' first, a truely distributed real-time kernel for the transputer and other processors. Current version comercialised under Virtuoso name.

virtuoso: a virtual single processor programming system for distributed real-time applications

Documents