[ieee comput. soc 12th international workshop on rapid system protyping. rsp 2001 - monterey, ca,...

A Tool box to Map System Level Communications on HW/SW Architectures *

Denis Hommais, FrCdCric PCtrot, and Ivan AugC Equipe ASIMLIP6

Universitt Pierre et Marie Curie Paris. France

Abstract

We present a hardware and software tool box that allows to niap an initial specification done in term of tasks comniu- nicating through FIFOs onto a HWBW architecture.

Comrnunication templates are used to characterize the way data are exchanged between tasks realized either in software or in hardware. From these templates, we define the specifications -in terms of resources necessary to handle the communications- of a hardware niodule and a set of +

software drivers. These Inte f a c e Modules and their softvare counterparts

allow simple Virtual Component reuse since they not only implement protocol compatibility through the use of the VCI/OCB standard but also system level conimunications through semantics widely accepted in the design coinmu- n ity.

The intetjtiace modules have been used in the COSY project for the iniplementation of a video decoder by Philips starting from a system-level description and performing conimunication rejinement using Cadence VCC and Philips’ own conimunication schemes.

1. Introduction

Communicating sequential processes[ 1 ;2] are often used to describe the behavioral specifications of systems. The use of this formalism for the description of embedded systems is presented in [ 1, 3,4] and several HW/SW Codesign frameworks support this semantict5, 6,7].

The starting point of our work is therefore the specification of an application as a set of tasks communicating through FIFOs. Each task uses blocking primitives to read and write data through mono-directional, bounded, lossless channels organized as FIFOs. This model is an extention of the Kahn processes network[2] model, in the which the FIFOs are infinite (and therefore the writes non-blocking). Kahn processes networks have the property that the order of the data within each channel doesn’t depend on the order

of evaluation of the processes. Having bounded FIFOs pre- serves this property, but 1) makes the write primitive blocking and 2) may lead to deadlocks[8].

The application specification relies on classical read(channe1, buffer, n) and write(channe1, buffer, n) primitives, where channel is a point-to- point link identifier and buffer is an ordered set of n items. Since the primitives are blocking ones, the content of the buffer may not be altered during their execution. The process that writes into the FIFO is called the producer and the one that reads from the FIFO the consumer.

This paper deals with the mapping of the communications from such a specification to an actual hard- warekoftware implementation on the target architecture of Fig. 1. The architecture is symmetric multi-processor

........................................

I SMP Kemel with pthreads

-./ Coprocessor I Coprocessor I

-A-

Figure l . Typical target system

(SMP), and the software runs on top of an SMP kernel implementing the POSIX threads. The master and slave interfaces are ad-hoc hardware modules that implements hardware support for several communication types.

Based on a general FIFO model[9], we describe some representative communications templates to perform data exchanges efficiently, for a producer in SW or HW to a consumer in SW or HW. The templates are then analyzed to determine the respective responsibilities of the software and the hardware to actually perform the communication at _ _ minimal hardware cost.

‘This work has been funded in part by the COSY 25443 Esprit Project.

1074-6005/01 $10.00 0 2001 IEEE 77 ‘ I

2. FIFO Communications

FIFO communication means performing 4 actions: 1. Data transfer between both channels ends. The process

knows an upper bound on the number of items that can be transfered by reading the FIFO status, thus no lock of the FIFO data is necessary to perform the transfer,

2 . FIFO status shared variable update by the reader and the writer. This is a variable shared by the producer and the consumer, therefore concurrent accesses shall not occur,

3. Suspending a task when it cannot perform the required operation due to lack of full or empty FIFO slots,

4. Resuming a task when either data or slots are available allowing a blocked operation to resume. The way these actions are performed clearly depends on

the HW or SW nature of the processes and on their implementation. The task of communication synthesis as we define it consists in mapping a given channel onto one predefined communication template.

The general approach is inspired of the one presented in [ 10, 113, that however stays in the generality. Our contri- bution is to formalize more clearly what data transfers are made from in a multi-tasks, multi-processor environment, and to present practical templates for them.

The read and write primitives perform inter-process communications. They can be implemented using the POSIX threads API, and therefore be used as is either for the specification or for the implementation of the SW/SW communications on the target architecture.

3. Communication Mapping

In software, communication mapping is handled by the implementation of the read and write primitives on top of a multi-tasking kernel that supports the POSIX threads. Ac- cessing a FIFO buffer for reading or-writing is a simple memory copy. However updating its status must be performed under the control of a mutex that guarantees exclu- sive access to the variable. To access the shared variable, a mutex-lock must be performed. If the mutex is already locked, then the calling thread is blocked until i t is unlocked by an other thread using mu tex-unl ock.

If data (resp. FIFO slots) are not available, the task is suspended waiting for a condition to become true. Depend- ing on the hardware or software nature of the task it com- municates with, either conditions (cond with cond-wai t) or POSIX semaphores (sem with sem-wai t) are used. The condition is changed by the producer (resp. consumer) thread once it has accessed the FIFO with either cond-signal or semjost. The way the copy is performed depends on the template that is used to implement the communication.

In hardware, we have chosen to use a communication protocol as close as possible to the read and write primitives, because these are used in the task behavioral specification. This necessitate to define a hardware interface module that performs data transfer from a coprocessor that uses

the vector protocol presented in Fig. 2 through the interconnect and provides mechanisms to support synchronization between the producer and the consumer. This choice is dic- tated by the following reasons:

0 The tool that synthesizes the task as a coprocessor doesn’t have to know about the bus protocol, the mas- terness/slaveness of the VC, the interrupts handling, the system address map, ..., that are system and communication implementation related,

0 The user that wants to try other communications means, like an other bus standard, a switch, or even simply an other manner of using the communication resources doesn’t have to express these details in the behavior of the component. This allows easy exploration of the communication means without the need of modifying the be- haviors and resynthesizing the coprocessors,

0 Unlike a FIFO protocol in the which the items are pro- duced or consumed one by one, the vector protocol spec- ifies the number of items to be exchanged during a transfer, and this information can be used to optimize the transfer at system level.

CLK 1 REQ

LENGTH

DIN ( M) DI x x DZ x w )

STEP

DONE

Figure 2. Timing diagram of a vector transfer

This vectorprotocol is as follows. The coprocessor sets the number of items to transmit on the LENGTH lines and as- serts the REQ signal. This is the hardware counterpart of the call of the software primitive. The STEP signal vali- dates the data present on the DATA lines. When all items have been transfered, the DONE signal is asserted. This is the hardware counterpart of returning from the primitive. This protocol can be implemented with a two states FSM for each channel. To handle this protocol, the coprocessor must have an internal buffer. The absence of address (or index) lines avoids dependencies regarding the buffer implementation, that can be random access or not.

All the details and complexity of the communication protocols are hidden in a hardware interface module and its software drivers. The module has a VCI interface to guar- antee interconnect level compatibility, and can be either initiator or target of a VCI transfer.

VCI is a necessary but not sufficient condition to IP reuse, because at this level it is already necessary to have a common high-level communication setup between the VCs. One way of ensuring that is to define a set of communication templates and to build a programmable Hardware Interface Module to support them.

78

Figure 3. Mapping threads and communication onto an architecture

The Fig. 3 illustrates the mapping process. The tasks are mapped either on software -no processor is specified since it is an SMP- or on hardware. The communications are tagged with a predefined template that instanciates pre- existing components, depending on the peak throughput, la- tency, periodicity, burstiness, etc, of the transfer.

4. Communication Templates

We present a few distinctive communication templates to outline the resources needed by the Interface Module. All these templates share a common concern: ensuring proper data exchange while minimizing tasks context switches, bus usage and interrupt generation.

4.1. SW to SW

A channel for SW/SW communication is initialized usign the channelInitSS1 ( s l o t s , size), where s lo t s is the number of slots of the FIFO, and size the size of an item in bytes. For this template, the FIFO is in memory. This requires a kernel implementing the POSIX threads to run on the processors. As - almost - all current kernels are POSIX compliant, this is solely a compilation of the initial primitives for the target processor.

4.2. SW to HW and vice-versa

A channel between a SW producer and a HW consumer is created with channelInitSH1 (slots, size, addr, line), where the two first parameters are as above, a d d r is the physical address of the FIFO, address from which other FIFO resources can be known, such as the number of items in the physical FIFO, etc, and l i n e that is the

memory

Figure 4. Software/Software communication

index of the interruption line to be used to make one coprocessor aware of a transfer related information. The FIFO can either be in memory or in the interface. We present here the case where the FIFO is in the interface, and we assume that the producer is a SW task and the consumer a HW coprocessor. The opposite transaction is symmetrical.

The physical FIFO depth is not related to the size param- eter of the primitive. Data is split into chunks that can be accepted by the interface. If the FIFO is in the interface,

Figure 5. Software/Hardware communication

the status is implemented in the interface itself and updated by the hardware. It is automatically incremented when a datum is written by in the FIFO and decremented when the coprocessor reads a data using the vector protocol.

We need a mechanism that can stop the writing task when the FIFO is full. This allows other software tasks to run while the producing task waits for slots to be available in the FIFO. Conversely, when some data are consumed by the

~~

Procedure 1 SW/HW communication: Producer CHANNELWRITE(chunne l . buf; s i z e ) n = sire p = depth - s t a t u s for ever do

m = m i n ( n , p ) fori = 0 tom - 1 do

end for n = n - m if n = 0 then

break end if threshold = m i n ( m , depth) mask = mask V itline atomically sEM-WAIT(channe1)

f i f o = buf!fIi]

end for return size

coprocessor, we need a mechanism to wake up the produc-

79

ing task so it can be rescheduled by the system. In the full software implementation of the task, we use a semaphore to stop and restart the task. The writing task is in software, so it can execute a sem-wait to stop itself waiting for the consumer to empty the FIFO. This is illustated by the code of Procedure 1 . To wake up the task when slots are vacant, we need to execute a semjost. To do so, the interface module implements a programmable threshold that generates an interrupt when the FIFO contains at least as many empty slots as the threshold value. When the interrupt is raised, the handler executes a semqost. Note that the primitives used in the interrupt handler shall not block the task currently be- ing executed, as this could cause deadlocks. The interrupt handling routine is presented in Procedure 2.

Procedure 2 SW/HW communication: Interrupt handler THRESHOLDINTERRUPTHANDLER(channe1) mask = maskA -itline atomically SEM-POST(channe!)

4.3. HW to HW

The HW/HW template that we present here also has the FIFO in the interface, which is therefore a slave. We assume that the master is the producer and the slave the consumer for the sake of the explanation, but the template is completely symmetric.

1 Copro

Figure 6. Hardware/Hardware communication

To start a transfer, the CPU configures the master with the slave address and the length of the transfer. It also ask the master to emit an interrupt once the transfer is over. It then starts the master interface.

Let us now assume that the producer sends many data and that the consumer reads slowly. The simple solution is to have the slave emits wait acknowledges. This, however, waste bandwidth and may even lead to a timeout. To avoid timeout, one can issue retracts. However, the bus will be very congested then, since many masters will compete for the bus only to see if they can resume their transfers. This waste bus bandwidth for the other masters that have indeed data to transfer.

An other solution is to have the master check the number of empty slots and send only this amount of data. This needs an other master to fetch the FIFO status and clever hardware to build packets from them. This also need some way of

asking the master to poll the slave status, either using an internal timer or a CPU request. This clearly isn’t viable for high performance communications.

The last solution is to use split transactions. Either the bus standard supports it -AMBA does- or it doesn’t -like the PI-Bus- in which case we propose is a simple form of split transaction implemented in the bus wrappers. We index p the actors on the producer side and c the ones on the consumer side. The master, is coupled to a slave,, and the slave, is coupled to a master,. Master, tries to write into a full FIFO. Slave, issues a split acknowledge and master,, goes into a frozen state, waiting for a wakeup signal to become true. When the condition for which slave, sent a split is no longer true, master, writes into slave, at a specific address. This activates the wakeup signal, and master, can resume its transfer. Note that master, doesn’t know how many slots are available in the FIFO, but since the coprocessor on top of slave, makes vectorized read requests, the adequate number of data will be transfered at once.

This handshake is quite sophisticated, but it is necessary to both avoid the use of the CPU when a FIFO is full, and avoid wasting bandwidth.

5. Hardware Interface Architecture

Two considerations have been retained for the architecture of the interface module:

1. The module must be usable on target busses with different protocols and different size and timing constraints,

2. The module must be able to cope with different communication strategies,

3. The interface module must be modular to grow with the number of channel needed by the coprocessor plugged on top of it.

The Fig. 7 shows the internal architecture of the Inter- face Module with its bus wrappers. The interface module has two VCI interfaces. One is a target to act as a slave on the bus and one is an initiator for bus master access. Be- hind the VCI interface, three kinds of modules are needed to support the communication templates: a slave module, a master module and an interrupt source.

From the tasks point of view, a task has always the initiative of a data transfer, because the task call a primitive when it needs to read or write data. The vector protocol also follows this semantic. Between the coprocessor and the interface module, the coprocessor always has the initiative of the transfer, but its initiative is bounded to reading or writing a given internal FIFO of the interface module.

The slave module contains the FIFO of the communication channel. Depending on the direction of the channel, the coprocessor reads (or writes) in the FIFO using the vector protocol. The slave module FIFO can be accessed through the bus slave wrapper and the VCI interface, by any master on the bus.

80

I I

t IT

Figure 7. Hardware Interface Module architecture

Two other resources used by the communication templates are addressed in the slave module: the FIFO status register that give the number of filled slots in the FIFO and the threshold interrupt generator that raises an interrupt when the number of filled slots is equal to a programmed value.

Vector

Virtual Component

Figure 8. Implementation of the split transaction in the wrappers.

The master module is used when the internal FIFO is not the target of the primitive: the interface module must fetch or store data in a memory or in another interface module as a master on the system bus. The coprocessor does not have any knowledge on address map. The master module must generate addresses to access the memory or a FIFO in another interface module. To do so, the master module

provides a programmable address generator. This address generator can be configured by the software to address a FIFO in an interface module or a circular buffer, that is a FIFO in memory. The parameters of this address generator are the first address of the transfer, the address of the last item (for a circular buffer), the increment to add to the current address to get the next one, the total length of the transfer to access only a finite number of data. It can also generate, upon request, an interrupt to inform the processor that the transfer is over.

The support for the split transaction is the responsibility of the wrappers, as illustrated in Fig. 8, that uses the naming convention of the previous section.

The interrupt source receives the interrupt request of the other submodules to send only one request to the processor. When the processor receives the interrupt, i t reads a vector in the interrupt source. This vector is the one that corre- sponds to the active interrupt with the highest priority. A vector is a 32 bits register that can be initialized by the processor at boot time. In the interrupt source, the interrupts can be individually masked. The processor can get the status of each interrupt line and an acknowledge can be send to the interrupt requester when the vector is read.

6. Results

The Interface Module can be parameterized in term of number of submodules: up to 8 slave modules and up to 2 master modules. Each submodule can be parameterized independently in term of data size, FIFO depth and interrupt generation.

The Interface Module has been described in C for the cycle truehit true CASS[ 121 simulator. The choice of com- piled C simulation is motivated by the speed of execution, the ease of debug and the accuracy of the evaluation of each communication template.

For actual hardware, a generic synthesizable RT level VHDL model has been designed. The synthesis of a module with the following configuration : 2 slave FIFO (twice 22 slots) and 2 masters (twice 12 slots) needs 11000 gates, 0.7 mm2 in 18pm, and has a propagation time of 8ns.

The fairly automated level of communication mapping allows to quickly map a behavior onto an architecture.

The interface module and the software drivers have been used to implement a system realizing the decoding of a flow of jpeg pictures in real time. Using the high performance cycle true hardware software CASS simulator[ 131, a cross- compiler for the target processor and a POSIX kernel, it is easy to evaluate the performances of each possible implementation of the system (see Fig. 9).

Fig. 9 4 4 presents the task graph of the jpeg decoder and Figures 9.(b) to 9.(f) possible implementations. The vcims is the interface module subject of this paper, and the mw and sw are respectively VCUPI-Bus master wrappers and VCUPI-Bus slave wrappers. All other hardware modules are PI-Bus compatible.

81

,.... . ~ . . ~ . .....

Figure 9. Implementations of the JPEG specification

All the implementations have been simulated at the cycle level. As a result the full software implementation (a) has to run at 1.2 GHz whereas the mostly hardware implementation (f) has to run at 16 MHz to decode 25 images per seconds.

7. Conclusion

We have presented hardware and software components to be used to perform the mapping of a FIFO communication on a HW/SW architecture. A few templates based on these components have illustrated their practical use.

During the COSY project, these hardware modules have been used by Philips, with Philips own templates, as target for the communications mapping by Cadence VCC. We also have developed C simulation models for Philips TSS [ 141 simulator so to allow cycle true simulation of the resulting system.

This approach can be used for system with heteroge-

References

[ I ] C. A. R. Hoare. Ci~ninriinicuting Sequentiul Proce.wes. Prentice Hall, 1985. [2] Gilles Kahn. The semantics of a simple language for parallel programming. In

Injoriimtion Pri~cessing 74, pages 47 1475 . Stockolm. Sweden, August 1974. [3] Jean-Paul Calvez. EnrbeddedReul-The Swteni.s John Wiley & Sons, 1993. [41 Daniel D Gajrki, Frank Vahid, Sanjiv Narayan. and J i e Gong. Spec;ficrrri,,n

uiid Design of Embedded Sysrenis Prentice Hall, 1994. [ 5 ] E.A. Lee J.T. Buck, S. Ha and D.G. Messerschmitt Ptolemy: A framework

for simulating and prototyping heterogeneous systems. Intrrnuriiinul Jiiurnul uf Cuniputer Simulution. 4 . April 1994.

[6] F. Balann, E. Sentovich, M. Chiodo, P. Gusto. H. Hsieh. B. Tabbara, A. Ju- recska. L. Lavagno, C. Passerone. K. Suzuki. and A. Sangiovanni-Vincentelli. Hurdwure-Software Co-desrgn of Enibrdded Sv.vteni.\ - The POUS upprouch. Kluwer Academic Publishers, 1997.

Cadence adds system-level design tool to eda flow. EETimes, January 2000. http: //www.cadence.com/technology/ hwsw/ciertovcc/articles/.

[8] Thomas M. Parks. BoundedScheduling of Proces.\ Networks. PhD thesis, BSE Pnnceton University, 1995.

[9] J-Y. Brunel, W. M. Kruijtzer, H. 1. H. N. Kenter, F Pitrot. L. Pasquier, E. A. de Kock, and W. J. M. Smits. Cosy communication ip's. In 37th Design Au- roniarion Conference, pages Pages 406-409, Los Angeles, CA, June 2000.

[IO] Daniel D. Gajski. Frank Vahid. Sanjiv Narayan, and Jie Gong. Specificurion and D e s i p i~EnibeddrdS~.steni.s, chapter 8. pages 355-370 In [41, 1994

[7] Michael Santarini.

neOuS processor architectures, such as the Ones using a DSP. In that new communication have to be de- fined to communicate with the DSPs, and the task are then

[ I I 1 Asawaree Kalavade. System-Lrvrl Codesign i)/ Mixed Hurdwure-S@iure SKY- tems. PhD thesis, Electronics Research Laboratory, Berkeley. 1995.

[I?] Frederic Pitrot, Denis Hommais, and Alain Greiner. Cycle precise core based hardwarelsoftware system simulation with predictable event propagation. In rhr . . _ 23"' Euronricru Conference, pages 182-1 87, Budapest, Hungary, September 1997. IEEE. Denis Hommais and FrCderic Pitrot. Efficient combinational loops handling for cycle precise simulation of system on a chip. In 24th Eummicm. pages pages 51-54, Vesteras, Sweden, August 1998. IEEE.

[I41 Frans Theeuwen. System simulation at phiiips research. Talk at the MEDEA Workshop on System Simulation, May 1998.

considered, from the the general purpose processor point of to the hardware Ones: it is the DSP respon-

sibility to schedule multiple DSP mapped tasks, and when

coprocessor, it uses those new communication templates.

[ 131

such a task needs to exchange data with the processor or a

82

[ieee comput. soc 12th international workshop on rapid system protyping. rsp 2001 - monterey, ca,...

Documents