vlsi system design - bfh · pdf filevlsi system design connection and communication ... some...

.

. . . . . . . . .

MicroLab-I3S v1.0, 28 May 2006

University of Applied Sciences Berne (Biel School of Engineering)

VLSI System Design

Connection and Communication Models

Berner Fachhochschule

Hochschule für Technik und Informatik

Dr. Marcel Jacomet


1. Connection and Communication Models

1.1. Interprocess Communication

Before starting to discuss hardware connection architectures needed for data exchange and communication, first the needs at higher abstraction levels are presented. Any multitasking or distributed system needs a mechanism for communication between tasks or processes. In embedded computing systems the real-time operating system (RTOS) provides the interprocess communication mechanism as part of the process abstraction. In a system-on-chip a process might be a piece of software running on microprocessor or it might be an algorithm directly implemented on a hardware unit (for example a FSM-D architecture model) due to high speed requirements. In the following text the name processing element (PE) is defined to cover both implementation possibilities of a process.

1.1.1. Blocking vs. Non-Blocking Communication

In general, a process can initiate a communication in one of two ways:

• blocking • non-blocking

After initiating a blocking communication, the initiating process goes into the waiting state until it receives a response. Non-blocking communication allows the initiating process to continue execution after sending the communication request. Both types of communication are useful as we will see in the following text.

1.1.2. Shared Memory Communication Model

We distinguish between two major styles of interprocess communication:

• shared memory • message passing

The two are logically equivalent – given one, you can build an interface that implements the other. However, some application programs may be easier to write using one rather than the other. Depending on the system-on-chip hardware platform, one may be easier to implement than the other. Figure 1 illustrates how shared memory communication works in the example of a bus-based system. Two processing elements communicate through a shared memory location. Both processing elements PE1 and PE2 have been designed to know the address of the shared location. If PE1 wants to send data to PE2, it writes to the shared location. PE2 then reads the data from that location.


Figure 1: Shared memory communication model implemented with in bus architecture.

Let us consider the situation of Figure 1 where two processing elements want to communicate through a shared memory block. There must be a mechanism to tell one processing element when the data from the other processing element are ready. This might be implemented by a flag, an additional shared data location. The flag may have the values “not busy” or “busy writing” and others like “data ready”. Processing element PE1, for example, would write the data, and then set the flag to “busy writing”. If the flag is used only by PE1, then the flag can be implemented using a standard memory write operation. If the same flag is used for bidirectional signaling between PE1 and PE2, care must be taken. Consider the following situation:

• PE1 reads the flag location and sees that it is “not busy”. • PE2 reads the flag location and sees that it is “not busy”. • PE1 sets the flag location to “busy writing” and writes data to the shared

location. • PE2 erroneously sets the flag to “busy writing” and overwrites the data left by

PE1.

The above scenario is caused by a critical timing race between the two processing elements. To avoid such problems, the processing element bus must support an atomic test-and-set operation, which is available on a number of microprocessors. The test-and-set operation first reads a location and then sets it to a specified value. It returns the result of the test. If the location was already set, then the additional set has no effect but the test-and-set instruction returns a false result. If the location was not set, the instruction returns true and the location is in fact set. The bus supports this as an atomic operation that cannot be interrupted.

1.1.3. Semaphore

A semaphore is a language level synchronization construct. It can be implemented by a test-and-set operation. Let’s assume that the system provides one semaphore that is used to guard access to a block of protected memory. Any processing element that wants to access the memory must use the semaphore to ensure that no other process is actively using it.

Figure 2: Access to protected memory area by semaphores.

As shown in Figure 2 the semaphore names by tradition are P() (from the Dutch word proberen: to test) to gain access to protected memory and V() (from the Dutch word verhogen: to increment) to release it. The P() operation uses a test-and-set to repeatedly test a location that holds a lock on memory block. The P() operation does not exit until

/* non-protected memory access is here */ P(); /* wait for semaphore (wait)*/ … /* do protected memory access here */ V(); /* release semaphore (signal)*/ /* non-protected memory access is here */


the lock is available. Once it is available, the test-and-set automatically sets the lock. Thereafter the process element can work on the protected memory block. The V() operation resets the lock, allowing other processes access to the protected memory region by using the P() operation.

1.1.4. Message Passing Communication Model

The message passing model complements the shared memory communication model. As shown in Figure 3, each communication entity has its own message send/receive unit. The message is not stored on the communications link, but rather at the endpoints of the sender and receiver. In contrast, shared memory communication can be seen as a memory block used as a communication device, in which all the data are stored in the communication link (shared memory).

Figure 3: Message passing communication model.

Applications in which units operate relatively autonomously are natural candidates for message passing communication. For example, a car control system has a microcontroller per device – gear unit, speed unit, left front light, back right window, and so on. The devices must communicate relatively infrequently; furthermore, their physical separation is large enough that we would not naturally think of them as sharing a central pool memory. Passing communication packets among the devices is a natural way to describe coordination between these devices. Message passing is the natural implementation of communication in many control applications that do not normally operate with external memory.

The message passing communication model will be discussed in more detail in the client-server model chapter on page 32. Specifically the un-buffered and buffered message passing as well as the reliable message passing methods are presented there.

1.1.5. Signals

The two major forms of interprocess communication mechanisms are shared memory and message passing. However, some operating systems (Unix) support another very simple communication mechanism – the signal. A signal is simple because it does not pass data beyond the existence of the signal itself. A signal is analogous to a hardware interrupt, but it can entirely be implemented in software. A software implemented signal is generated by a process and transmitted to another process by the operating system. Some signals are used to abstract CPU exceptions to the operating system, some relate to operating system services, several are ways to terminate a process. Signals normally have a fairly limited repertoire of functionality – it would be difficult to build a rich set of communication processes based on such “error” or “process termination” signals (example: kill(0,sigabrt) in Unix).


1.1.6. Data Dependencies in Communication

We almost always need to pass data among processes or processing elements. There are two distinct cases to consider: communication processes that execute at the same data rate, and those that communicate but operate at different rates.

Figure 4: Communication process with identical data rates resulting in data dependencies among processes.

When communication processes run at the same data rate, we express their relationship using data dependencies (see Figure 4). Before a process or process element can become ready, all the processes on which it depends must complete and send their data to it. The data dependencies define a partial ordering of process execution – P1 and P2 can execute in any order (or in interleaved fashion) but must complete before P3, and P3 must complete before P4. All processes must complete before the end of the period. The data dependencies must form a directed acyclic graph (DAG) – a cycle in the data dependencies is difficult to interpret in a periodically executed system.

A set of processes with data dependencies is known as a task graph. Although the terminology for elements of a task graph varies from author to author, we will consider a component of the task graph (a set of notes connected by data dependencies) as a task and the completed graph as the task set.

Communication among processes that run at different rates cannot be represented by data dependencies because there is no one-to-one relationship between data coming out of the source process and going in the destination process. Nevertheless, communication among processes of different rates is very common. Figure 5 illustrates the communication required among three elements of an MPEG audio/video decoder. Data come into the decoder in the system format, which multiplexes audio and video data. The system decoder process demultiplexes the audio and video data and distributes it to the appropriate processes. Multirate communication is necessarily one way – for example, the system process writes data to the video process, but a separate communication mechanism must be provided for communication from the video process back to the system process.

Figure 5: Communication among processes at different rates.

1.1.7. Deadlocks

In multitasking or distributed systems, several processes may compete for a finite number of resources. If resources are not available at the requesting time, the requesting process enters a wait state. It may happen that waiting processes will never again change state, because the resources they have requested are held by other waiting


processes. This situation is called a deadlock. Perhaps the best illustration of a deadlock can be drawn from a law passed by the US Kansas legislature early in last century. It said, in part: “When two trains approach each other at a crossing, both shall come to a full stop and neither shall start up again until the other has gone.”

Deadlocks can be described more precisely in terms of a directed graph. A directed edge from process Pi to resource Rj (Pi -> Rj) signifies that process Pi requested an instance of resource type Rj and is currently waiting for that resource. A directed edge from resource type Rj to process Pi (Rj -> Pi) signifies that an instance of resource type Rj has been allocated to process Pi. Given this definition of a resource allocation graph, it can be shown easily that, if the graph contains no cycles, then the no process in the system is deadlocked. If, on the other hand, the graph contains a cycle, then a deadlock may exist. If each resource type has exactly one instance, then a cycle implies that a deadlock has occurred. On the other hand, if each resource type has several instances, then a cycle does not necessarily imply that a deadlock occurred. Figure 6 illustrates this concept by a deadlock example. If process P3 with its vertices would not be present, no deadlock situation would be present. Process P2 will once upon a time release the resource R1 as it has not to wait on any requested resources. This would break up the cycle of the constrained graph. Suppose now that in addition to all other request and allocations, process P3 requests an instance of resource R2. Since no resource instance is currently available, this request brings all processes in a deadlock situation.

Figure 6: Resource allocation graph with a deadlock.

Principally there are three methods for dealing with the deadlock problem:

• Ignoring: We can ignore the problem all together, and pretend that deadlocks never occur in the system.

• Detection: We can allow the system to enter a deadlock state and then recover.

• Prevention: We can use a protocol to ensure that the system will never enter a deadlock state.

• Avoidance: We can avoid deadlocks by allocating resources carefully.

All four are potentially applicable to multitasking or distributed systems. Ignoring the deadlock problem at all is very popular and the solution used in most operating systems. Deadlock detection and recovery is also very popular, primarily because prevention and avoidance are so difficult. Deadlock avoidance in fact is never used in distributed systems, even not in single-processor systems as no practicable solutions are known. For more details on deadlock detection and prevention algorithms have a look at the literature, for example in Distributed Operating Systems [4].


1.2. Basic Principles used in Communication

Several basic principles used in communication models are described in this chapter. Some of the described principles are not restricted to the communication models only but can also be found in many other domains. Bus arbitration algorithms as they will be described in the following text, for example, are also used for process scheduling in real time operating systems.

1.2.1. The Four-Cycle-Handshake

The basic building block of most communication protocols is the four-cycle-handshake. The handshake ensures correct communication between two devices, signaling that one is ready to transmit and the other is ready to receive. The handshake uses a pair of wires dedicated to the handshake: req (meaning requesting for a data transfer) and ack (meaning acknowledge). Let’s assume that device 1 is always initiating the communication, therefore device 1 is the master device and device 2 is always the slave device responding two the master’s request (see Figure 7).

Figure 7: Handshake driven communication in fixed master/slave configuration.

The master device 1 is able to initiate a data write or a data read operation. Figure 8 illustrates both data write and data read communication initiated by the master device. If the communication is only done in one direction, the rw (read/write) signal is not necessary. Note, that at the end of the handshake communication, both handshake signals must again be set to their inactive state.


Figure 8: The four-cycle-handshake. Write data transfer from device 1 to device 2 (top), read data transfer from device 2 to device 1 (bottom).

1.2.2. Bus Allocation and Bus Arbitration Algorithms

In complex systems-on-chips different resources can be made available to potential users. Sharing resources is a common technique in hardware as well as in software algorithms. Typical examples of such sharable resources in a hardware system are memories, processing elements, input-output devices and buses. Let us call the units able to request resources masters, the sharable resources slaves. Accessing a resource is done by some sort of communication between master and slave. Even the communication channel itself can be a sharable resource itself. So in a first step the requesting master unit needs first to have access to the communication channel before it can have access to the resource itself.

The efficiency of the bus allocation system is determined mainly by the possible application for a bus system. In order to judge as simply as possibly which bus systems are suitable for which applications the literature includes a method of classifying bus allocation procedures. Generally we distinguish between the following classes:

• Allocation on a fixed time schedule. Allocation is made sequentially to each participant for a maximum duration regardless of whether this participant needs the bus at this moment or not (examples: token slot or token passing).

• Bus allocation on the basis of need. The bus is allocated to one participant on the basis of transmission requests outstanding, i.e. the allocation system only considers participants wishing to transmit (examples: Carrier Sense Multiple Access, CSMA, Carrier Sense Multiple Access with Collision Detection, CSMA/CD, Carrier Sense Multiple Access with Arbitration on Message Passing, CSMA/AMP, flying master, round robin or bitwise arbitration).

rw

req

data

ack

data data stored

data valid

requ

est

phas

e

stor

eph

ase

ackn

owle

dge

phas

e

end

phas

e

device 1

device 2

rw

req

data

ack

data data stored

data valid

requ

est

phas

e

stor

eph

ase

ackn

owle

dge

phas

e

end

phas

e

device 1

device 2

rw

req

data

ack

data

data stored

data valid

requ

est

phas

e

achn

owle

dge

phas

e

stor

eph

ase

end

phas

e

device 1

device 2

rw

req

data

ack

data

data stored

data valid

requ

est

phas

e

achn

owle

dge

phas

e

stor

eph

ase

end

phas

e

requ

est

phas

e

achn

owle

dge

phas

e

stor

eph

ase

end

phas

e

device 1

device 2


Bus access can be further classified into centralized bus access control and decentralized bus access control depending on whether the control mechanisms are present in the system only once (centralized) or more than once (decentralized). A communication system with a designated node (inter alia for centralized bus access control) must provide a strategy to take effect in the event of a failure of the master node. This concept has the disadvantage that the strategy for failure management is difficult and costly to implement and also that the takeover of the central node by a redundant node can be very time-consuming.

For bus allocation based on “the basis of need” we can further distinguish between the following bus access methods:

• Non-destructive bus access. With methods of this type the bus is allocated to one and only one node either immediately or within a specified time following a single bus access (by one or more nodes). This ensures that each bus access by one or more nodes leads to an unambiguous bus allocation (examples: token slot, token passing, round robin, bitwise arbitration)

• Destructive bus allocation. Simultaneous bus access by more than one node causes all transmission attempts to be aborted and therefore there is no successful bus allocation. More than one bus access may be necessary in order to allocate the bus at all, the number of attempts before bus allocation is successful being a purely statistical quantity (examples: CSMA/CD, Ethernet).

Bus Arbitration Algorithms: If several masters want to have access to a slave, a so called arbitration method may be established to assign the communication channel to the master, as a sharable resource can be allocated by no more than one requesting unit at one time. Bus arbitration is the method to grant the bus to a requesting master. Different arbitration prioritizing schemas can be used, the two most often implemented schemas are:

• rate-monotonic arbitration • round robin arbitration

Fixed-priority or rate-monotonic arbitration is one of the oldest arbitration as well as scheduling (rate-monotonic scheduling) policies and is still very widely used. It is a static arbitration algorithm because it assigns fixed priorities to the requesting masters. The arbiter grants the bus to the requesting master with the highest priority. It turns out that these fixed priorities are sufficient to schedule the bus access in many situations.

Round robin arbitration is a rotating priority schema. It is a dynamic arbitration algorithm as the priorities change dynamically after each new bus grant (see Figure 9). In the round robin algorithm the requestor that was granted the bus most recently receives the lowest priority while the requestor position next to it receives the highest priority and the remaining requestors receive subsequently higher priority based on their position. Due to the rotating priority schema, the round robin algorithm is often called a fair arbitration schema.

Figure 9: Round robin arbitration example.


1.3. The Bus-Based Architecture

1.3.1. Embedded Computing System

We concentrate on embedded computing platforms created using instruction set architecture processors (ISA) or microprocessors (CPU), application specific processing elements implemented as FSM-D architectural models, I/O devices and memory components. The processor is an important element of the embedded computing system, but it cannot do its job without memories, I/O devices and the application specific processing elements. We need to understand how to interconnect processors and devices using the processor bus. Luckily there are many similarities between platforms required for different applications, so we can extract some generally useful interconnect principles by examining a few basic concepts.

1.3.2. Typical Processor Bus

The bus is the mechanism by which a processing element communicates with other processing elements, with memory and with devices. A bus is, at minimum a collection of wires, but the bus also defines a protocol by which the processing elements, memories and devices communicate.

Microprocessor buses are based on the handshake for communication between the CPU and the other system components. The fundamental bus operations are reading and writing. Figure 10 shows the structure of a typical bus that supports both reading and writing operations.

Figure 10: A typical microprocessor bus configuration.

The timing diagram for such a typical bus configuration is illustrated in Figure 11. Note that the handshake signals address enable (addr en) and data ready (data rdy) implement the four-cycle handshake protocol. The read/write signal (r/w) defines the bus operation reading or writing. In a system-on-chip environment the system clock may be available for all sub-blocks, which is indicated by the clock line of the bus. By using the system clock the data transfer speed on a bus can be increased as processing element and each attached device can immediately react on the active edge of a system wide clock.


Figure 11: Timing diagram for a typical microprocessor bus. Read cycle with a wait state and a write cycle. Address and data bus may get tristate during their idle time.

Beyond the standard read and write operations, burst transfer operation can also be implemented. In a burst transfer the processor element sends one address but receives a sequence of data values.

Some buses provide disconnected transfers. In these buses, the request and response are separated. A first operation requests the transfer. The bus can then be used for further operations. The transfer is completed later, when the data are ready.

Some buses have data bundles that are smaller than the natural word size of the processor element. Using fewer data lines reduces the cost of the chip. Such buses are easiest to design when the CPU is natively addressable. A more complicated protocol hides the smaller data sizes from the instruction execution unit in the CPU. Byte addresses are sequentially sent over the bus, receiving one byte at a time; the bytes are assembled inside the CPU’s bus logic before being presented to the CPU proper.

1.3.3. Processor Bus with DMA

In the above described typical processor bus, the processor is the bus master and all attached devices are the slaves. In such a configuration each transaction requires the processing element to be in the middle of every read and write transaction. However, there are certain types of data transfers in which the processing element does not need to be involved. For example a high speed I/O element device element may want to transfer a block of data into memory. Such a situation may arrive in a MP3 player, when downloading compressed music files through an USB 2.0 I/O device directly to the player’s static memory. While it is certainly possible to write a tiny program for the processor that alternately reads the device and writes to memory, it would be much faster to eliminate the processor’s involvement and let the device and the memory communicate directly. This capability requires that some unit other than the processor be able to control operations on the bus.

Direct memory access (DMA) is a bus operation that allows both read and write operations not controlled by the processor. A DMA transfer is controlled by the DMA controller, which requests control of the bus from the processor. After gaining control, the DMA controller performs read and write operations directly between devices and memory without involvement of the processor.


Figure 12: A bus with DMA controller.

Figure 12 shows the configuration of a bus with a DMA controller. The DMA requires two additional control signals: bus request and bus grant. With the bus request the DMA asks the processor for ownership of the bus. With the bus grant the processor signals that the bus has been granted to the DMA controller. The processor will finish all pending bus transactions before granting the control of the bus to the DMA controller. When it does grant, it stops driving the other bus signals like r/w, addr en, and addr. Upon becoming bus master, the DMA controller has control of all these bus signals. Once the DMA controller is bus master, it can perform read and write operations using the same bus protocol as with any processor-driven bus transaction. Memory and I/O devices do not know whether the data transfer is performed by the processor or by a DMA controller. After the transaction is finished, the DMA controller returns the bus to the processor by de-asserting the bus request, causing the processor to de-assert the bus grant. Note that systems with DMA controllers implement the handshake method for the data transfer and in addition also use the handshake method to handle bus ownership.

The processor actually initiates a DMA transfer by setting the starting address and length registers appropriately and then writing the status register to set its start transfer bit. After the DMA operation is complete, the DMA controller typically interrupts the processor to tell it that the transfer is done. During the DMA transfer, the processor is not allowed to use the bus. But the processor may still continue to do some useful work for quite some time if he has enough instructions and data in his cache memory and registers. However, to prevent the processor from idling too long, most DMA controllers implement modes that occupy the bus for only a few cycles at a time. For example, the transfer may be made 16, 32 or 256 words at a time. After each block transfer, the DMA controller returns control of the bus to the processor and goes to sleep for a preset period, after which it requests the bus again for the next block transfer.

1.3.4. System Bus Configuration

A bus configuration with a processor and an additional DMA controller represents a more complicated system with already two bus masters. A complex embedded system-on-chip often has more than one bus. Bus hierarchies are implemented to cope with the needs of system-on-chip requirements. Different processing elements and devices have different speed requirements. As shown in Figure 13, high-speed devices may be connected to a high-performance bus, while lower-speed devices are connected to a low-speed bus. A small block logic known as a bridge allows the buses to connect to each other. The following three reasons to introduce such system bus configurations are summarized below:

• High-speed buses may provide wider data connections.


• A high-speed bus usually requires more expensive circuitry and connectors. The cost of low-speed devices can be held down by using lower-speed, lower-cost bus.

• The bridge may allow the buses to operate independently, thereby providing some parallelism in I/O and other data transfer operations.

A new element interfacing two buses has to be introduced, the so called bus bridge. The bridge acts as a slave on the fast bus and acts as a master on the slow bus. The bridge takes the commands from the fast bus on which it is as a slave and issues those commands on the slow bus. It also returns the results from the slow bus to the fast bus – for example it returns the results of a read on the slow bus to the fast bus.

Figure 13: Typical system bus configuration.

Let’s assume a single write operation handled by the bus bridge. The bridge first reads the data from the fast bus and sets up the handshake for the slow bus. Operations on the slow and fast bus of the bridge should be overlapped as much as possible to reduce the latency of bus-to-bus transfers.

The bridge serves as a protocol translator between the two buses as well. If the bridges are very close in protocol operation and speed, a simple state machine may be enough for implementation. If there are larger differences in the protocol and the timing between the two buses, the bridge may need to use registers to hold some data values temporarily.

Currently there exists no system bus standardization. However, some company system bus definitions are widespread and freely available. A rather simple system bus architecture is the AvalonTM bus from Altera used for their NiosTM processor. Moreover, hierarchical bus architectures are the AMBATM bus from AMD used for their ARMTM processor and the CoreConnectTM bus from IBM used for their PowerPC processor families. Both hierarchical bus systems AMBATM as well as CoreConnectTM have very similar features and are widely used by other companies. In addition to the system bus specification, the companies provide bus model toolkits to assist in the system-on-chip design.

These system buses define on-chip communication architectures and protocols for designing high performance embedded system-on-chips. The advantage of using such system bus “standards” is the technology independency, the highly reusability of peripherals, system macro-cells (processing elements) and last bus not least the modular system design approach supported by standardized bus architectures. In the following text the AMBA bus architecture is presented as an example for the different existing hierarchical bus systems used in sophisticated system-on-chip designs.

1.3.5. AMBATM Bus Architecture

The AMBA bus supports CPUs, memories, and peripherals integrated in a system-on-chip environment. Similar to the typical system bus configuration shown in Figure 13 the AMBA bus architecture also includes two buses. The AMBA advanced high-


performance bus (AHB) is optimized for high-speed transfers and is directly connected to processing elements, like CPUs. It supports several high-performance features:

• pipelined operation • burst transfers • split transactions • multiple bus master

The AMBA bus architecture offers a second high-performance bus, the AMBA advanced high-performance system bus (ASB). The ASB is very similar to the AHB, but does not support split transactions. A bridge can be used to connect the AHB or the ASB to an AMBA advanced peripherals bus (APB). The APB bus is designed to be simple and easy to implement: it is also supposed to consume relatively little power. The APB assumes that all peripherals act as slaves, simplifying the logic required in both the peripherals and the bus controller. It also does not perform pipelined operations, which simplifies the bus logic. The APB supports the following features:

• low power • latched address and control • simple interface • suitable for many peripherals

The APB provides the basic peripheral macro-cell communications infrastructure as a secondary bus from the higher bandwidth pipelined main system bus. Such peripherals typically:

• have interfaces which are memory-mapped registers • have no high-bandwidth interfaces • are accessed under programmed control.

An AMBA-based microcontroller typically consists of a high-performance system backbone bus (AHB or ASB), able to sustain the external memory bandwidth, on which processing elements (CPU, etc), on-chip memory and other DMA devises reside. This bus provides a high bandwidth interface between the elements that are involved in the majority of transfers. Also located on the high-performance bus is the mentioned bridge to the lower bandwidth APB, where most of the peripheral devices in the system are located (see Figure 14).

Figure 14: A typical AMBA-based embedded microcontroller system.

The external memory interface is application-specific and may only have a narrow data path, but may also support a test access mode which allows the internal AMBA AHB, ASB and APB modules to be tested in isolation with system-independent test sets.

The AMBA bus architecture may have a several dozens up to a few hundred signals. Therefore a clear signal naming convention has been introduced. A lower case ‘n’ in the signal name indicates that the signal is low-active, otherwise signal names are always all upper case. Test signals have a prefix ‘T’ regardless of the bus type. All signals belonging to the AHB bus are prefixed with the letter ‘H’, all signals belonging


to the APB bus are prefixed with the letter ‘P’. The ASB signals use different prefixes: all ASB signals are prefixed with ‘B’, with the exception ‘A’ which is an unidirectional signal between ASB bus master and the arbiter and ‘D’ which is a unidirectional ASB decoder signal. In the following text only a subset of all AMBA signals are mentioned as far as they are used for the AMBA bus functional operation discussion. A complete specification can be found in AMBA Specification (revision 2.0) documents from ARM [1].

1.3.6. Introduction to the AMBATM AHB Bus

An AMBA AHB design may contain one or more bus masters, typically a system would contain at least a processor and test interface. However, it would also be common for a DMA or DSP (digital signal processor) or any general processing element to be included as bus masters. The external memory interface, APB bridge and any internal memory are the most common AHB slaves. Any other peripheral in the system could also be included as an AHB slave. However, low-bandwidth peripherals typically reside on the APB. A typical AHB system design contains the following components (very similar for ASB systems):

• AHB master: A bus master is able to initiate read and write operations by providing an address and control information. Only one bus master is allowed to actively use the bus at any one time.

• AHB slave: A bus slave responds to a read and write operation within a given address-space range. The bus slave signals back to the active master the success, failure or waiting of the data transfer.

• AHB arbiter: The bus arbiter ensures that only one bus master at a time is allowed to initiate data transfers. Even though the arbitration protocol is fixed, any arbitration algorithm, such as highest priority or fair access can be implemented depending on the application requirements.

• AHB decoder: The AHB decoder is used to decode the address of each transfer and provide a select signal for the slave that is involved in the transfer. A single centralized decoder is required in all AHB implementations.

The AMBA AHB bus protocol is designed to be used with a central multiplexor interconnection schema. Using this schema all bus masters drive out the address and control signals indicating the transfer they wish to perform and the arbiter determines which master has its address and control signals routed to all of the slaves. A central decoder is also required to control the read data and response signal multiplexor, which selects the appropriate signals from the slave that is involved in the transfer. Figure 15 illustrates the structure required to implement an AMBA AHB design with three masters and four slaves. The AMBA AHB has the following bus characteristics:

• single cycle bus master handover • single clock edge operation • non-tristate implementation • wider data bus configurations (64/128 bits)


Figure 15: AMBA AHB bus interconnection schema.

The AHB bus is a synchronous bus using a bus clock signal HCLK (signal source is clock generator). An active low bus reset signal HRESETn (signal source is reset controller) is used to reset the system and the bus. This is the only active low signal. Before an AMBA AHB transfer can commence the bus master must be granted access to the bus. A granted bus master starts a AHB bus transfer by driving the address and control signals. These signals provide information on the address, direction and width of the transfer, as well as an indication if the transfer forms part of a burst. Two different forms of bursts are allowed, incremental and wrapping bursts, the latter wrapping at particular address boundaries.

A write data bus is used to move data HWDATA[31:0] (signal source is master) from the master (see Figure 16 AHM master interface) to a slave (see Figure 17 AHB slave interface), a read data bus is used to move data HRDATA[31:0] (signal source is slave) from slave to master. The transfer direction is defined by the HWRITE (signal source is master) control signal. Every transfer consists of an address and control cycle and one ore more cycles for the data (see Figure 18). The address HADDR[31:0] (signal source is master) cannot be extended and therefore all slaves must sample the address during this time. The data however can be extended by the slave using the HREADY signal (signal source is AHB slave) or the master (see explanations below). It is recommended, but not mandatory, that slaves do not insert more than 16 wait states to prevent any single access locking the bus for a large number of clock cycles.

Figure 16: AHB master interface.

Figure 17: AHB slave interface.


Figure 18: Basic data transfer on AMBA AHB bus (pipelined operation). The timing diagram does not distinguish between read and write operations in order to illustrate both in the same figure.

Every transfer can be classified into one of four different types indicated by HTRANS[1:0] (signal source is AHB master). IDLE (value “00”) indicates that no data transfer is required, BUSY (value “01”) transfer indicates that the bus master is delaying an ongoing burst of transfers (the address and control signals must reflect the next transfer in a burst), NONSEQ (value “10”) indicates the first transfer of a burst or a single transfer (address and control signals are unrelated to the previous transfer), SEQ (value “11”) finally indicates the remaining transfers in a burst (address and control are related to transfer). Figure 19 illustrates a burst transfer example.

Figure 19: Data transfer example on the AMBA AHB bus. In cycle C2 the master inserts a wait state, in cycle C6 the slave inserts a wait state.

• Cycle C1: The first transfer is the start of a burst an therefore HTRANS[1:0] shows NONSEQUENTIAL.

• Cycle C2: As an example, the master is unable to perform the second transfer of the burst and therefore the master uses the BUSY transfer to insert a wait state.

• Cycle C4: The master performs the third transfer (addr 0x28) immediately, but this time the slave is unable to complete and uses HREADY to insert a single wait state.

Four, eight and sixteen-beat bursts are defined in the AHB protocol, as well as undefined-length bursts and single transfers defined by the signal HBURST[2:0] (signal source is master). Both incrementing and wrapping bursts are supported in the protocol. Burst must not cross the 1kB address boundaries.

• Incrementing bursts access sequential locations and the address of each transfer in the burst is just an increment of the previous address. The burst


length is unspecified (indicated by INCR, value “001”). Single transfers are indicated by SINGLE (value “000”).

• For wrapping bursts, if the start address of the transfer is not aligned to the total number of bytes in the burst (size x beats) then the address of the transfer in the burst will wrap when the boundary is reached. The burst operations are identified by WRAP4 (value “010”), WRAP8 (value “100”), WRAP16 (value “110”) for 4-, 8-, and 16-beat wrapping bursts and by INCR4 (value “011”), INCR8 (value “101”) and INCR16 (value (“111”) for 4-, 8-, and 16-beat incrementing burst.

The total amount of data transferred in a burst is calculated by multiplying the number of beats by the amount of data in each beat, as indicated by HSIZE[2:0] (value “000” for byte, value “001” for half-word, value “010” for word, up to “111” for 128 bytes). All transfers within a burst must be aligned to the address boundary equal to the size of the transfer. For example word transfers must be aligned to word address boundaries that is HADDR[1:0]=0.

There are certain circumstances when a burst will be allowed to complete and therefore it is important that any slave design which makes use of the burst information can take the correct course of action if the burst is terminated early. If HTRANS[1:0] indicates that a NONSEQUENTIAL or IDLE transfer occurs, then this indicates that a new burst has started and therefore the previous one must have been terminated. Figure 20 illustrates very short half-word transfer incremental burst followed by a 4-beat wrapping burst for word transfer.

Figure 20:Half-word incremental burst followed by a four-beat wrapping burst for word size transfer.

The protocol control signals, HPROT[3:0] (source is master), provide additional information about a bus access and are primarily intended for use by any module that wishes to implement some level of protection. The signals indicate if the transfer is an op-code fetch, data access, a privileged mode access or user access.

During every transfer the slave shows the status using the response signals HRESP[1:0] (signal source is AHB slave) which gets the values OKAY for a normal processing, ERROR for an unsuccessful processing, RETRY and SPLIT both indicating that the transfer cannot complete immediately, but the bus master should continue to attempt the transfer. In a normal operation a master is allowed to complete all the transfer in a particular burst before the arbiter grants another master access to the bus. However, in order to avoid excessive arbitration latencies it is possible for the arbiter to break up a burst and in such cases the master must re-arbitrate for the bus in order to complete the remaining transfer in the burst.

After a master has started, the slave then determines how the transfer should process. The HREADY signal is used to extend the transfer and this works in combination with


the response signals HRESP[1:0], which provide the status of the transfer. The slave can complete the transfer in a number of ways. It can:

• complete the transfer immediately • insert one or more wait states to allow time to complete the transfer • signal an error to indicate the transfer has failed • delay the completion of the transfer, but allow the master and slave to back

off the bus, leaving it available for other transfers

The slave signals a successful completion of a transfer with HREADY high and an OKAY response HREPS[1:0]. As already seen in Figure 18, the slave can insert wait states in a transfer with HREADY low. The ERROR response is used by the slave to indicate some form of error condition with the associated transfer. Typically this is used for a protection error, such as an attempt to write to a read-only memory location. The SPLIT and RETRY response combinations allow slaves to delay the completion of a transfer, but free up the bus for use by other masters. These response combinations are usually only required by slaves that have a high access latency and can make use of these response codes to ensure that other masters are not prevented from accessing the bus for long periods of time. All responses require at least two-cycles, except the OK response which needs only one cycle. The two-cycle response is required because of the pipelined nature of the bus. To complete a transfer with any of these responses, the slave drives HRESP[1:0] to indicate ERROR, RETRY or SPLIT while driving HREADY low to extend the transfer for an extra cycle. In the final cycle HREADY is always driven high to end the transfer while HRESP]1:0] remains driven to indicate ERROR, RETRY or SPLIT. Figure 21 illustrates a transfer with an ERROR and a RETRY response. In the illustrated example, the slave is not ready to decide what kind of response it should issue, so a READY low in combination with an OKAY response is inserted before starting with the ERROR response. The difference between SPLIT and RETRY is the way the arbiter allocates the bus after its occurrence. For a RETRY the arbiter will continue to use the normal priority schema and therefore only masters having a higher priority will gain access to the bus. For a SPLIT transfer the arbiter will adjust the priority schema so that any other master requesting the bus will get access.

Figure 21: Example transfer with slave issuing ERROR and a RETRY response.

A simple address decoder (see Figure 22) is used to provide a select signal, HSELx (source is decoder), for each slave on the bus. The select signal is a combinatorial decode of the high-order address signals. A slave must only sample the address and control signals HSELx when HREADY is active. The minimum address space that can be allocated to a single slave is 1kB. All bus masters are designed such that they will not perform incrementing transfers over a 1kB boundary, that ensures that a burst never crosses an address decode boundary. In case where a system design does not contain a completely filled memory map, an additional default slave should be implemented to provide a response when any nonexistent address locations are accessed. The default slave should provide an ERROR response when a nonexistent address location is addressed during a NONSEQUENTIAL or SEQUENTIAL transfer


attempt. IDLE or BUSY transfers to nonexistent locations should result in a zero wait state OKAY response.

Figure 22: AHB decoder interface.

Typically the default slave functionality will be implemented as part of the central address decoder. Figure 23 shows a typical address decoding system and the slave select signals.

Figure 23: Central address decoder connection schema.

The arbitration mechanism is used to ensure that only one master has access to the bus at any one time. The arbiter performs this function by observing a number of different requests to use the bus and deciding which is currently the highest priority master requesting the bus. The arbiter also receives requests from slaves that wish to complete SPLIT transfers.

A bus master uses the HBUSREQx signal (signal source is AHB master) to request access to the bus during any cycle. The arbiter will use an internal priority algorithm to decide which master will be the next to gain access to the bus. The arbiter grants bus access to a master by issuing a HGRANTx signal (signal source is AHB arbiter). The number of the granted master is also issued by HMASTER[3:0] which can be used to control central address and control multiplexors. If the master requires locked access then it must also assert the HLOCKx signal (signal source is master) to indicate to the arbiter that no other masters should be granted the bus during the transfer.

Normally the arbiter will only grant a different bus master when a burst is completing. However, if required, the arbiter can terminate a burst early to allow a higher priority master access to the bus. When a master is granted the bus and is performing a fixed length burst it is not necessary to continue to request the bus in order to complete the burst. Figure 24 illustrates the arbitration process with the bus handover after a burst. Even if no master requested the bus (see first two cycles), the arbiter will assign a bus master which has to issue IDLE transfers.


Figure 24: Bus arbitration process with handover after a burst.

Both the SPLIT and RETRY transfer responses must be used with care to prevent bus deadlock. A single transfer can never lock the AHB as every slave must be designed to finish a transfer within a predetermined number of cycles. However, it is possible for deadlock to occur if a number of different masters attempt to access a slave which issues SPLIT and RETRY responses in a manner which the slave is unable to deal with. To prevent deadlocks, the slave must be able to withstand a request from every master in the system, up to a maximum number of 16. A slave which issues RETRY responses must be accessed by one master at a time. This is not enforced by the protocol of the bus and should be ensured by the system architecture. The interface of the AHB arbiter is shown in Figure 25.

Figure 25: AHB arbiter interface.

1.3.7. Introduction to the AMBATM AHB-Lite Bus

The AHB-Lite bus architecture (see Figure 26) is a subset of the full AHB specification and is intended for use in designs where only a single bus master is used. The AHB-Lite simplifies the AHB specifications by removing the protocol required for multiple bus master, which includes the request/grant protocol to the arbiter and the split/retry response from slaves. Masters designed to the AHB-Lite interface specification can be significantly simpler in terms of interface design, compared to a full AHB master. Any master that is already designed to the full AHB specifications can be used in an AHB-Lite system with no modification. The majority of AHB slaves can be used interchangeably in either AHB or AHB-Lite system. This is because AHB slaves that do not use either split or retry response are automatically compatible with both the full AHB and the AHB-Lite specification.


Figure 26: AHB-Lite single master bus system.

1.3.8. Introduction to the AMBATM ASB Bus

The advanced system bus (ASB) specification defines a high-performance bus that can be used in the design of high performance 16 and 32-bit embedded microcontrollers. The ASB supports the efficient connection of processors, on-chip-memories and off-chip external memory interfaces with low-power peripheral macro-cell functions. The bus also provides the test infrastructure for modular macro-cell test and diagnostics. A typical AMBA ASB-based embedded microcontroller system looks very similar to AHB-based system which is already illustrated in Figure 14. The bus functionality of a ASB-based system bus compared to a AHB-based system bus is slightly limited. The main functional difference is that split transfers are not possible on a ASB-based system bus. In addition the maximum transfer size is limited to word-size (32 bits). In contrast to the AHB bus, the ASB bus uses tri-state signals. For a complete description of the ASB system bus architecture refer to the AMBA specification manual [1].

1.3.9. Introduction to the AMBATM APB Bus

The APB is part of the AMBA hierarchy of buses and is optimized for minimal power consumption and reduced interface complexity. The APB appears as a local secondary bus that is encapsulated as a single AHB or ASB slave device. APB provides low-power extension to the system bus which builds on AHB or ASB signals directly. The APB bridge appears as a slave module which handles the bus handshake and control signal retiming on behalf of the local peripheral bus. By defining the APB interface from the starting point of the system bus, the benefits of the system diagnostics and test methodology can be exploited. The APB should be used to interface any peripherals which are low bandwidth and do not require the high-performance of a pipelined bus interface. The APB implementation typically contains a single APB bridge which is required to convert AHB or ASB transfers into suitable format for the slave devices on the APB. The bridge provides latching of all address, data and control signals, as well as providing a second level of decoding to generate slave select signals for the APB peripherals. All other modules on the APB are APB slaves. All APB slaves have the following interface specifications:

• address and control valid throughout the access (non-pipelined) • zero-power interface during non-peripheral bus activity • timing can be provided by decode with strobe timing (un-clocked interface) • write data valid for the whole access (allowing glitch-free transparent latch

implementation)

For convenience, a typical AMBA-based system is again illustrated in Figure 27. As can be seen, the APB bus only consists of two different elements, the APB bridge and the APB slaves. Compared to the high-performance systems buses AHB and ASB, the APB bus is very simple to implement.


Figure 27: Typical AMBA-based microcontroller system architecture.

The activity of the APB bus can be represented with three states only: IDLE, SETUP and ENABLE. IDLE is the bus default state when no transfer is accomplished. When a transfer is required, the bus moves into SETUP state, where the appropriate select signals, PSELx (signal source is APB bridge) are asserted. The bus only remains in the SETUP state for one cycle and will always move to the ENABLE state. In the ENABLE state, the strobe signal PENABLE (signal source is APB bridge) is asserted. During SETUP and ENABLE all address PADDR[31:0] (signal source is APB bridge), write signal PWRITE (signals source is APB bridge) and select signals PSELx remain stable. Again the ENABLE state only lasts for one cycle. If no other bus transfers are required, the bus will return to IDLE state.

The basic write transfer (write direction APB bridge to slave) is shown in Figure 28. It can bee seen, that after a completed transfer the bus returns to IDLE state. Due to reduced power consumption the address signal and write signal will not change after a transfer until the next access occurs.

Figure 28: Basic APB write transfer.

For the read transfer, the timing of the address, write and strobe signals are all the same as for the write transfer. In case of a read, the slave must provide the data PRDATA[31:0] (signal source is APB slave) during the ENABLE cycle. The data is sampled on the rising edge of the clock PCLK (signal source is APB bridge) at the end of the ENABLE cycle (see Figure 29).


Figure 29: Basic APB read transfer.

The APB bride is the only bus master on the AMBA APB. In addition, the APB bridge is also a slave on the higher-level system bus. Figure 30 shows the APB signal interface of the APB bridge. The bridge latches the address and holds it valid throughout the transfer and generates the strobe signal PENABLE.

Figure 30: APB bridge interface diagram.

The APB slaves have a simple, yet flexible, interface. The exact implementation of the interface will be dependent on the design style employed and many different options are possible. Figure 31 shows the signal interface of an APB slave. The APB slave interface is very flexible. For a write transfer the data can be latched on either a rising edge of PCLK or on a rising edge of PENABLE, both when PSEL is high.

Figure 31: APB slave interface diagram.

The timing diagram in Figure 32 shows a read transfer burst from the AHB bus to the APB slave. The transfer starts on the AHB at time C1. In very high clock frequency systems it may become necessary for the bridge to register the read data at the end of the ENABLE cycle and then for the bridge to drive this back to the AHB bus master in the following cycle. Although this will require an extra wait state for peripheral bus read transfers, it allows the AHB to run at a higher clock frequency, thus resulting in an overall improvement in system performance.

It is recommended that the APB is implemented with separate read and write data buses, which allows the use of either a multiplexed bus or OR-bus schema to interconnect the various slaves on the APB. If a tri-state bus is used then the read and write data buses may be combined into a single bus, as read data and write data never occur simultaneously.


Figure 32: Burst of read transfers from the AHB to the APB bus.

1.3.10. AMBATM Test Interface

The AMBA test philosophy allows individual modules in the system to be tested in isolation. Each module is designed so it can be tested only using transfers from the bus and does not rely on the interaction of any other system element. Therefore it is necessary to have access to the inputs and outputs of the peripheral that are not directly connected to the bus and this is provided by a test harness (see Figure 33.

Figure 33: Module test harness.

A low gate-count test interface controller (TIC) bus master is required in the system to allow externally applied test vectors to be converted into internal bus transfers (see Figure 34). The TIC uses a three-wire handshake mechanism to control the application of test vectors and the data path of the external bus interface (EIB) is used to provide a high speed 32-bit parallel vector interface. To support this method of test vector application a 32-bit bidirectional port must be available during test access. For a system with an external data bus interface of 32-bits this is straightforward. 16-bit and 8-bit data bus designs require, for example, 16 or 24 address lines to be configured as bidirectional test port signals for the test mode access.


Figure 34: External test bus interface.

The test bus is used as an input to apply address, control and write vectors. For read vectors the test bus is used as a device output. The TIC is a bus master that accepts test vectors from the external test bus TBUS[31:0] and initiates bus transfers. The TIC latches address vectors and, when required, increments the address to allow read and write bursts of the test vectors.

Figure 35 shows the sequence of events when applying a set of write test vectors. Initially an address vector is applied and this is followed by a write test vector. For a detailed description of the test controller see [1].

Figure 35: Example of a write test vector transfer.

1.4. The Switching-Based Architectures

A non hierarchical bus, and even a hierarchical bus has limited available bandwidth. Since all devices connect to the bus, communications can interfere with each other. To build an embedded system with a high number of processing elements or with a need for high bandwidth, a different method is needed to connect the processing and other

HCLK

TREQA

TREQB

TACK

TBUS[31:0]

HTRANS[31:0]

HADDR[31:0]

control

HWDATE[31:0]

HREADY

C6 C7 C8C1 C2 C3 C4 C5

dataA

HBURST[2:0],HWRITE,HSIZE[2:0],HPROT[3:0]

addr write1 addrwrite2 write3

IDLE NONSEQ IDLESEQ SEQ

A A+4 A+8

dataA+4

dataA+8

C9HCLK

TREQA

TREQB

TACK

TBUS[31:0]

HTRANS[31:0]

HADDR[31:0]

control

HWDATE[31:0]

HREADY

C6 C7 C8C1 C2 C3 C4 C5

dataA

HBURST[2:0],HWRITE,HSIZE[2:0],HPROT[3:0]

addr write1 addrwrite2 write3

IDLE NONSEQ IDLESEQ SEQ

A A+4 A+8

dataA+4

dataA+8

C9


elements with each other. At the opposite end of the generality spectrum from the bus are the switching-based architectures. The crossbar shown in Figure 36 is one of its representatives.

Figure 36: Switching-based architecture: the crossbar.

Each processing and memory element is connected to the crossbar switch. A crossbar not only allows any input to be connected to any output, it also allows all combinations of input/output connections to be made. At every intersection is an electronic crosspoint switch that can be opened and closed in hardware. When a PEout (microprocessors or processing elements) wants to access a particular PEin, (memories, I/O devices, microprocessors or general processing elements) the crosspoint switch connecting them is closed momentarily, to allow the access to take place. The virtue of the crossbar switch is that many PEout’s can be accessing PEin’s at the same time, although if two PEout’s try to access the same PEin simultaneously, one of them will have to wait. Thus, for example, we can simultaneously connect PEout1 to PEin4, PEout2 to PEin3, PEout3 to PEin2 and PEout4 to PEin1 or any other combination of PEoutx to PEinx. Multicast connections can also be made from one PEout to several PEin. The major drawback of the crossbar switch is its expense. For example, with n PEout’s and n PEin’s, n2 crosspoint switches are needed. For large n this can be prohibitive. As a result, people have looked for, and found, alternative switching architectures that require fewer switches.

Figure 37: An omega switching architecture.

Many other architectures have been designed that provide varying amount of parallel communication at varying hardware costs. Figure 37 shows an example of a so called omega switching architecture. The crossbar is a direct network in which messages go from source to destination without going through any memory element.

Most connection architectures are blocking, meaning that there are some combinations of sources and destinations for which messages cannot be delivered simultaneously. A non-hierarchical bus is a maximal blocking connection architecture since any message on the bus blocks messages from any other node.

In general, network connection architectures differ from bus-based architectures in how they implement communication protocols. Both need handshaking to ensure that processing elements do not interfere with each other. But in most network connection


architectures, only the basic communication functions are implemented in hardware and implement many other protocol intensive operations in software.

1.5. Introduction to Networks

There are several reasons to build network–based embedded systems. When the processing tasks are physically distributed, it may be necessary to put some of the computing power near where the events occur. Consider, for example, an automobile: The short time delays required for tasks such as engine control generally means that at least parts of the tasks are done physically close to the engine. Data reduction is another important reason for distributed processing. It may be possible to perform some signal pre-processing on captured data to reduce its volume, for example local finger print feature recognition after an optical finger print capture would significantly reduce the data stream. Reducing the data on a local separate processing element may significantly reduce the load of the processing element that makes use of that data. Modularity is another motivation for network-based design. For instance, when a large system is assembled out of existing components, those components may use a network port as a clean interface that does not interfere with the internal operation of the component in ways that using the local system bus would. Distributed embedded system design is another example of hardware/software co-design, since we must design the network topology as well as the software running on the network nodes (see Figure 38).

Figure 38: Example of distributed embedded system.

The most important difference between a distributed system and a SoC is the interprocess communication. In a SoC most interprocess communication implicitly assumes the existence of shared memory. A typical example is the produces-consumer problem, in which one processing element writes into a shared memory and an another processing element reads from it. Even that most basic form of synchronization, the semaphore, requires that one word (the semaphore variable itself) is shared. In a distributed system there is no shared memory whatsoever, so the entire nature of interprocess communication must be rethought.

1.5.1. OSI Network Abstraction Model

To make it easier to deal with the numerous levels and issues involved in communication, the International Standards Organization (ISO) has developed a reference model that clearly identifies the various levels involved. The model gives the abstraction levels standard names and points out which level should do which job. This model is called Open System Interconnection Reference Model [3] usually abbreviated OSI model. Although it is not intended to give a full description of this model and all


its implications here, a short introduction will be helpful to understand interprocess communication in a broader context.

The OSI model is designed to allow open systems to communicate. An open system is one that is prepared to communicate with any other open system by using standard rules that govern the format, contents, and meaning of the message send and received. These rules are formalized in what is called protocols. Basically, a protocol is an agreement between the communication parties on how communication is to proceed.

The OSI model distinguished between two general type of protocols:

• With connection-oriented protocols, before exchanging data, the sender and receiver first explicitly establish a connection, and possibly negotiate the protocol they will use. When they are done, they must release the connection. The telephone is an example of a connection-oriented protocol.

• With connectionless protocols, no setup in advance is needed. The sender just transmits the first message when it is ready. Dropping a letter in a mailbox is an example of a connectionless communication.

In the OSI model, communication is divided up into seven levels of layers, as shown in Figure 39. Each layer deals with one specific aspect of communication. In this way, the problem can be divided up into manageable pieces, each of which can be solved independent of the others. Each layer provides an interface to the one above it. The interface consists of a set of operations that together define the service the layer is prepared to offer its users.

Figure 39: The OSI model layers.

In the OSI model, when a process A on machine #1 wants to communicate with process B on machine #2, it builds a message and passes the message to the applications layer on its machine. This layer might be a library procedure, for example, but it could also be implemented in some other way (e.g. inside the RTOS, on an external processing element, etc). The application layer software then adds a header to the front of the message and passes the resulting message across layer 6/7 interface to the presentation layer. The presentation layer in turn adds its own header and passes the result down to the session layer, and so on. Some layers add not only a header to the front, but also a trailer to the end. When it hits the bottom, the physical layer actually transmits the message, which by now might look as shown in Figure 40.


Figure 40: A typical message as it appears on the network.

When the message arrives at machine #2, it is passed upward, with each layer stripping off and examining its own header. Finally the message arrives at the receiver, process B, which may reply to it using the reverse path. Each layer has its own protocol that can be changed independently of the other ones. It is precisely this independence that makes layered protocols attractive. Each one can be changed as technology improves, without the other ones being affected. The collection of protocols used in a particular system is called protocol suit or protocol stack.

The seven layers of the OSI model, shown in Figure 39, are intended to cover a broad spectrum of networks and their use. Some networks may not need the service of one or more layers. However, any data network should fit into the OSI model. The OSI layers from the lowest to the highest level of abstraction can briefly be described as follows:

1. The Physical Layer: The physical layer is concerned with transmitting the 0s and 1s. How may volts to use for 0 and 1, how many bits per second can be send, and whether transmission can take place in both directions simultaneously are key issues in the physical layer. In addition, the size and shape of the network connector are of concern here. The physical layer protocol deals with standardizing the electrical, mechanical, and signaling interfaces so that when one machine sends a 0 bit it is actually received as a 0 bit and not a 1 bit.

2. The Data Link Layer: The physical layer just sends bits. As long a s no errors occur, all is well. However, real communication networks are subject to errors, so some mechanism is needed to detect and correct them. This mechanism is the main task of the data link layer. What is does is to group the bits into units, sometimes called frames, and see that each frame is correctly received. The data link layer does its work by putting a special bit pattern on the start and end of each frame, to mark them, as well as computing a checksum by adding up all the bytes in the frame in a certain way. The data link layer appends the checksum to the frame. When the frame arrives, the receiver re-computes the checksum from the data and compares the result to the checksum following the frame. If they agree, the frame is considered correct and is accepted. If they disagree, the receiver asks the sender to retransmit it. Frames are assigned sequence numbers (in the header), so everyone can tell which is which.

3. The Network Layer: On a local area network (LAN), there is usually no need for the sender to locate the receiver. It just puts the message out on the network and the receiver takes it off. A wide area network (WAN), however, consists of a large number of machines, each with some number of lines with other machines, rather a large scale map showing major cities and roads connecting them. For a message to get from the sender to the receiver it may have to make a number of hops, at each one choosing an outgoing line to use. The question of how to choose the best path is called routing, and is


primary task of the network layer. Two network-layer protocols are in widespread use, one connection oriented (X.25 is favored by operators of public networks such a telephone companies) and one is connectionless (internet protocol). The internet protocol (IP) can send messages (IP packet) without any setup, and each IP packet is routed to its destination independent of all others.

4. The Transport Layer: Packets can be lost on the way from the sender to the receiver. Although some applications can handle their own error recovery, others prefer a reliable connection. The job of the transport layer is to provide this service. The idea is that the session layer should be able to deliver a message to the transport layer with the expectation that it will be delivered without loss. Upon receiving a message from the session layer, the transport layer breaks it into pieces small enough for each to fit in a single packet, assigns each one a sequence number, and then sends them all. The discussion in the transport layer header concerns which packets have been sent, which have been received, how many more the receiver has room to accept, and similar topics. The official ISO transport protocol has five variants, one of them is known as TCP (transmission control protocol), which is widely used as combination TCP/IP at universities and on most UNIX systems.

5. The Session Layer: The session layer is essentially an enhanced version of the transport layer. It provides dialog control, to keep track of which party is currently talking, and it provides synchronization facilities. The latter are useful to allow users to insert checkpoints into long transfers, so that in the event of a crash it is only necessary to go back to the last checkpoint, rather than all way back to the beginning.

6. The Presentation Layer: Unlike the lower layers, which are concerned with getting the bits from the sender to the receiver reliable and efficiently, the presentation layer is concerned with the meaning of the bits. Most messages consist of structured data and may contain information such as people’s names, addresses amount of money, and so on. In the presentation layer it is possible to define records containing fields like these and then have the sender notify the receiver that a message contains a particular record in a certain format. This makes it easier for machines with different internal representations to communicate.

7. The Application Layer: The application layer is really just a collection of miscellaneous protocols for common activities such as electronic mail, file transfer, and connecting remote terminals to computers over network. The best known of these are the X.400 electronic mail protocol and the X.500 directory server.

Although it may seem that embedded systems would be too simple to require use of the OSI model, the model is in fact quite useful. Even relatively simple embedded networks provide physical, data link, and network services. An increasing number of embedded systems provide Internet service that requires implementing the full OSI protocol stack.


1.5.2. The Client-Server Model

At a first glance, layered protocols along the OSI lines look like a fine way to organize distributed embedded systems. In effect a sender sets up a connection with the receiver, and then pumps the bits in, which arrive without error, in order, at the receiver. The problem with the OSI model is that every time a message is sent it must be processed by about half a dozen layers, each one generating and adding a header on the way down or removing and examining a header on the way up. All of this work takes time. In WAN, where the number of bits/sec is typically fairly low, this overhead may be feasible. However, for LAN-based distributed embedded systems, the protocol overhead can be substantial. So much time is wasted running protocols that the effective throughput over the LAN is often only a fraction of what the LAN could do. As a consequence, LAN-based distributed embedded systems often do not use layered protocols at all, or if they do, they use only a subset of the entire protocol stack.

The idea behind the client-server model is to structure the operating system as a group of cooperating processes, called servers, that offer services to users, called clients. The client and server machines normally run the same microkernel, with both the clients and servers running as user processes. A machine or processing element may run a single process, or it may run multiple clients, multiple servers, or a mixture of the two. To avoid the considerable overhead of the connection-oriented protocols such as OSI or TCP/IP, the client server model is usually based on a simple, connectionless request/reply protocol. The client sends a request message to the server asking for some services (e.g., read a block of a file). The server does the work and returns the data requested or an error code indicating why the work could not be performed.

The primary advantage of the client-server model is the simplicity. The client sends a request and gets an answer. No connection has to be established before use or torn down afterward. The reply message serves as the acknowledgement to the request. From the simplicity comes another advantage: efficiency. The protocol stack is shorter and thus more efficient. Assuming that all the machines or processing elements have an identical client-server model implementation, only three levels of the OSI protocol are needed, as shown in Figure 41. The physical and data link protocols take care of getting the packets from the client to the server and back. These are always handled by the hardware. No routing is needed and no connections are established, so layer 3 and 4 are not needed. Layer 5 is the request/reply protocol. It defines the set of legal requests and the set of legal replies to these requests. There is no session management because there are no sessions. The upper layers are not needed either.

Figure 41: The client-server model and its comparison to the OSI protocol.

Due to this simple structure, the communication services provided by the microkernel can, for example, be reduced to two system calls, one for sending messages and one for receiving them. This simplified case represents the implementation of a message passing communication model by means of the client-server model. The necessary system calls can be invoked through simple library procedures or even be implemented directly in hardware by finite-state machines. A possible syntax my be: send(dest,&mptr) and receive(addr,&mptr). The former sends the message pointed by mptr to a process identified by dest and might cause the caller to be blocked until the message has been send. The latter too may cause the caller to be blocked until a


message arrives. When one does, the message is copied to the buffer pointed to by mptr and the caller is unblocked. The addr parameter specifies the address to which the receiver is listening. Many variants of these two procedures and their parameters are possible, even non-blocking implementation variants.

1.5.3. Message Passing Communication in Client-Server Model

Figure 42: Un-buffered message passing in client-server model.

Just as system designers have the choice between blocking and non-blocking primitives, they also have the choice between buffered and un-buffered send/receive primitives. A call receive(addr,&mptr) tells the microkernel of the machine on which it is running that the calling process is listening to address addr and is prepared to receive one message send to that address (see Figure 42). A single message buffer, pointed to by mptr, is provided to hold the incoming message. When the message comes in, the receiving microkernel copies it to the buffer and unblocks the receiving process. The address is used to refer to a specific process. This schema works fine as long as the server calls receive before the client calls send. The call to receive is the mechanism that tells the sever’s microkernel which address the server is using and where to put the incoming message. The problem arises when the send is issued before the receive. The server’s microkernel does not know which of its processes if any is using the address in the newly arrived message, and it does not how either where to copy the message. One implementation strategy is to discard the message, let the client time out, and hope the server has called receive before the client transmits. This approach is easy to implement, but the client may need several times before succeeding, worse it may even conclude that the server has crashed.

The second approach to dealing with this problem is to have the receiving microkernel keep incoming messages around for a little while, just in case an appropriate receive is done shortly. Although this method reduces the chance that a message will have to be thrown away, it introduces the problem of storing and managing prematurely arriving messages. Buffers are needed and have to be allocated, freed, and generally managed. A conceptually simple way of dealing with this buffer management is to define a new data structure called a mailbox (see Figure 43). A process that is interested in receiving messages tells the microkernel to create a mailbox for it, and specifies an address to look for in network packets. Henceforth, all incoming messages with that address are put in the mailbox. The call to receive now just removes one message from the mailbox, or blocks (assuming blocking primitives) if none is present. In this way, the microkernel knows what to do with incoming messages and has a place to put them. This technique is frequently referred to as a buffered primitive (see F). At first glance, mailboxes appear to eliminate the race conditions caused by messages being discarded and clients giving up. However, mailboxes are finite and can fill up. When a message arrives for a mailbox that is full, the microkernel once again is confronted with the choice of either keeping it around for a while, hoping that at least one message will be extracted from the mailbox in time, or discard it. These are precisely the same choices we had in the un-buffered case. Although we have perhaps reduced the probability of trouble, we have not eliminated it, and have not even managed to change its nature.


Figure 43: Buffered message passing using mailbox in client-server model.

In some systems, another option is available: do not let a process send a message if there is no room to store it at the destination. To make this scheme work, the sender must block until an acknowledgment comes back saying that the message has been received. If the mailbox is full, the sender can be backed up and retroactively suspended as though the scheduler had decided to suspend it just before it tried to send the message. When space becomes available in the mailbox, the sender is allowed to try again.

So far we have assumed that when a client sends a message, the server will receive it. As usual, reality is more complicated than our abstract model. Messages can get lost, which affects the semantics of the message passing model. Suppose that blocking primitives are being used. When a client sends a message, it is suspended until the message has been sent. However, when it is restarted, there is no guarantee that the message has been delivered. The message might have been lost. Three different approaches to this problem are possible.

The first one is just to redefine the semantics of send to be unreliable. The system gives no guarantee about messages being delivered. Implementing reliable communication is entirely up to the users. The post office works this way. When you drop a letter in a mailbox, the post office does its best to deliver it, but it promises nothing.

The second approach is to require the kernel on the receiving machine to send an acknowledgement back to the kernel of the sending machine. Only when this acknowledgement is received will the sending microkernel free the client process. The acknowledgement goes from microkernel to microkernel; neither the client nor the server ever sees an acknowledgement. Just as the request from client to server is acknowledged by the server’s microkernel, the reply from the server back to the client is acknowledged by the server’s microkernel. Thus a request and reply now takes four messages, as shown in Figure 44. Comparing the reliable message passing model to the OSI layered model now shows that four layers are realized.

Figure 44: Reliable message passing. Individually acknowledged messages. Comparison to OSI communication model.

The third approach is to take advantage of the fact that client-server communication is structured as a request from the client to the server followed by a reply from the server to the client (see Figure 45). In this method, the client is blocked after sending a message, assuming blocking primitives are used. The server’s microkernel does not


send back an acknowledgement as seen before. Instead, the reply itself acts as the acknowledgement. Thus the sender remains blocked until the reply comes in. If it takes too long, the sending microkernel can resend the request to guard the possibility of a lost message. Although the reply functions as an acknowledgement for the request, there is no acknowledgement for the reply. Whether this omission I serious or not depends on the nature of the request. If, for example, the client asks the server to read a block of a file and the reply is lost, the client will just repeat the request and the server will send the block again. No damage is done and little time is lost. On the other hand, if the request requires extensive computation on the part of the server, it would be a pity to discard the answer before the server is sure that the client has received the reply. For this reason, an acknowledgement from the client’s microkernel is sometimes used. Until this packet is received, the server’s reply does not complete and the server remains blocked (still assuming blocked primitives are used). In any event, if the reply is lost and the request is retransmitted, the server’s microkernel can see that the request is an old one and just send the reply again without waking up the server.

Figure 45: Reliable message passing. Reply being used as acknowledgement of the request.

1.5.4. The I2C Bus

Many different networks have seen widespread use in distributed embedded systems over the years. Several interconnect networks have been developed especially for distributed embedded systems as well as for system-on-chip computer system. A selection of some of these buses are briefly described the following text.

The I2C bus, developed and patented by Philips, allows integrated circuits to communicate directly with each other via a simple bidirectional 2-wire bus (see Philips Application Note AN422 [5]). The comprehensive family of CMOS and bipolar ICs incorporating the on-chip I2C interface offers many advantages to designers of digital control for industrial, consumer and telecommunications equipment. A typical system configuration is shown in Figure 46.

Figure 46: Structure of an I2C bus system.

Interfacing the devices in an I2C based system is very simple because they connect directly to the two bus lines: a serial data line (SDA) and a serial clock line (SCL). A prototype system or a final product version can easily be modified or upgraded by “clipping” or “unclipping” ICs or on-chip bus-interfaces to or from the bus. The simplicity of designing with the I2C bus does not reduce its effectiveness; it is a reliable, multi-master bus with integrated addressing and data-transfer protocols. In


addition, the I2C-bus compatible bus-interfaces and ICs provide cost reduction benefits to equipment manufacturers, some of which are smaller IC packages and a minimization of PCB traces and glue logic.

Physical Layer:

Both I2C-bus lines SDA and SCL are connected to a positive supply via a pull-up resistor, and remain logical high when the bus is not busy. Each device is recognized by a unique address - whether it is a microcomputer, LCD driver, memory or keyboard interface - and can operate as either a transmitter or receiver, depending on the function of the device. A device generating a message or data is a transmitter, and a device receiving the message or data is a receiver. Obviously, a passive function like an LCD driver could only be a receiver, while a processing element, microcontroller or a memory can both transmit and receive data.

Figure 47: Electrical interface to the I2C bus.

The basic electrical interface to the bus is shown in Figure 47. The bus does not define particular voltages to be used for high or low levels so that either bipolar or CMOS circuits can be connected to the bus. Both bus signals use open drain for CMOS or open collector for bipolar circuits. A pull-up keeps the default state of the signal high, and transistors are used in each bus device to pull down the signal when a “0” is to be transmitted. The open drain/open collector signaling allows several devices to simultaneously write the bus without causing electrical damage.

Care has to be taken when using the I2C bus to connect chips with different power supplies. With a single MOS-FET a bi-directional level shifter circuit can be realized to connect devices with different supply voltages of e.g. 5 Volt and 3.3 Volt to one I2C-bus system. The level shifter can also isolate a bus section of powered-down devices from the I2C -bus, allowing the powered part of the I2C-bus to operate in a normal way (see Philips Application Note AN97055 [6]).

Figure 48:Bi-directional level shifter circuit connects two different voltage sections of an I2C bus system.

In the bus system of Figure 48 the left section has pull-up resistors and devices connected to a 3.3 Volt supply, the right section has pull-up resistors and devices connected to a 5 Volt supply. The level shifter for each bus line is identical and consists of one discrete N-channel enhancement MOS-FET. The gates have to be


connected to the lowest supply voltage VDD1, the sources to the bus lines of the “lower voltage” section, and the drains to the bus lines of the “higher voltage” section. The diode between the drain and substrate is inside the MOS-FET present as n-p junction of drain and substrate.

For the level shift operation three states has to be considered:

• State 1: No device is pulling down the bus line and the bus line of the “lower voltage” section is pulled up by its pull-up resistors to 3.3 V. The gate and the source of the MOS-FET are both at 3.3V, so its VGS is below the threshold voltage and the MOS-FET is not conducting. This allows that the bus line at the “higher voltage” section is pulled up by its pull-up resistor to 5V. So the bus lines of both sections are “high”, but at a different voltage levels.

• State 2: A 3.3V device pulls down the bus line to a “low” level. The source of the MOS-FET becomes also ”low”, while the gate stays at 3.3V. The VGS rises above the threshold and the MOS-FET becomes conducting. Now the bus line of the “higher voltage” section is also pulled down to a “low” level by the 3.3V device via the conducting MOS-FET. So the bus lines of both sections become “low” at the same voltage level.

• State 3: A 5V device pulls down the bus line to a “low” level. Via the drain-substrate diode of the MOSFET the “lower voltage” section is in first instance pulled down until VGS passes the threshold and the MOS-FET becomes conducting. Now the bus line of the “lower voltage” section is further pulled down to a “low” level by the 5V device via the conducting MOS-FET. So the bus lines of both sections become “low” at the same voltage level.

The three states show that the logic levels are transferred in both directions of the bus system, independent of the driving section. State 2 and state 3 perform the “wired AND” function between the bus lines of both sections as required by the I2C-bus specification. Other supply voltages than 3.3V for VDD1 and 5V for VDD2 can be applied, e.g. 2V for VDD1 and 10V for VDD2 is feasible. In normal operation VDD2 must be equal to or higher than VDD1. If it is necessary also to isolate the “higher level voltage” section when it is powered off, then the level shifter circuit can be extended as shown in Figure 49.

The pull-up resistors to VDD3 are not necessary for the proper operation and may have a high resistance value. They can be added to prevent the MOS-FET drains become floating at a “high” level. VDD3 is preferably connected to the highest supply voltage. Even more sections with a higher, a lower or a same supply voltage value can be added by connecting these sections via additional MOS-FET’s to the common drain terminals in the same way as the other sections. Every section is isolated from the rest of the bus system when its supply voltage is switched off, while level shifting between all other sections remain operational.


Figure 49: Extended level shifter circuit for I2C bus with power off capability at the 5V section.

Data Link Layer:

Master and slaves: When a data transfer takes place on the bus, a device can either be a master or a slave. The device which initiates the transfer, and generates the clock signals for this transfer, is the master. At that time any device addressed is considered a slave. It is important to note that a master could either be a transmitter or a receiver; a master microcontroller may send data to a RAM acting as a transmitter, and then interrogate the RAM for its contents acting as a receiver - in both cases performing as the master initiating the transfer. In the same manner, a slave could be both a receiver and a transmitter. The I2C is a multi-master bus. It is possible to have, in one system, more than one device capable of initiating transfers and controlling the bus (see Figure 46). A processing element or microcontroller may act as a master for one transfer, and then be the slave for another transfer, initiated by another processor on the network. The master/slave relationships on the bus are not permanent, and may change on each transfer. As more than one master may be connected to the bus, it is possible that two devices will try to initiate a transfer at the same time. Obviously, in order to eliminate bus collisions and communications chaos, an arbitration procedure is necessary.

Bus Arbitration: The I2C design has an inherent arbitration and clock synchronization procedure relying on the wired-AND connection of the devices on the bus. In a typical multi-master system, a microcontroller program should allow it to gracefully switch between master and slave modes and preserve data integrity upon loss of arbitration. The I2C bus arbitrates on each message. When sending, devices listen to the bus as well. If a device is trying to send a logic “1” but hears a logic “0”, it immediately stops transmitting and gives the other sender priority. In many cases, arbitration will be completed during the address portion of a transmission, but arbitration may continue into the data portion. If two devices are trying to send identical data to the same address, then of course they never interfere and both succeed in sending their message.

Bit and byte transfer: One data bit is transferred during each clock pulse (see Figure 50). The data on the SDA line must remain stable during the “high” period of the clock pulse in order to be valid. Changes in the data line at this time will be interpreted as control signals.

• A “high”-to-“low” transition of the data line (SDA) while the clock signal (SCL) is “high” indicates a Start condition.

• A “low”-to-“high” transition of the SDA while SCL is “high” defines a Stop condition.

The bus is considered to be busy after the Start condition and free again at a certain time interval after the Stop condition. The Start and Stop conditions are always generated by the master. The number of data bytes transferred between the Start and Stop condition from transmitter to receiver is not limited. Each byte, which must be eight bits long, is transferred serially with the most significant bit first, and is followed by an acknowledge bit. The clock pulse related to the acknowledge bit is generated by


the master. The device that acknowledges has to pull down the SDA line during the acknowledge clock pulse, while the transmitting device releases the SDA line (“high”) during this pulse (see Figure 50). Figure 51 illustrates the byte transfer of Figure 50 in more detail by identifying the SDA and SCL signal line drivers.

Figure 50: Transmitting a single byte on the I2C bus with start and stop conditions.

Figure 51: The SDA and SCL line drivers are identified for the above byte transfer.

A slave receiver must generate an acknowledgement after the reception of each byte, and a master must generate one after the reception of each byte clocked out of the slave transmitter. If a receiving device cannot receive the data byte immediately, it can force the transmitter into a wait state by holding the clock line (SCL) ”low”. When designing a system, it is necessary to take into account cases when acknowledge is not received. This happens, for example, when the addressed device is busy in a real time operation. In such a case the master, after an appropriate “time-out”, should abort the transfer by generating a Stop condition, allowing other transfers to take place. These “other transfers” could be initiated by other masters in a multi-master system, or by this same master. There are two exceptions to the “acknowledge after every byte” rule. The first occurs when a master is a receiver: it must signal an end of data to the transmitter by NOT signaling an acknowledgement on the last byte that has been clocked out of the slave. The acknowledgement related clock, generated by the master should still take place, but the SDA line will not be pulled down. In order to indicate that this is an active and intentional lack of acknowledgement, we shall term this special condition as a “negative acknowledge” (see NAck at Figure 52). The second exception is that a slave will send a negative acknowledge when it can no longer accept additional data bytes. This occurs after an attempted transfer that cannot be accepted. The bus design includes special provisions for interfacing to microprocessors which implement all of the I2C communications in software only - it is called “Slow Mode”. When all of the devices on the network have built-in I2C hardware support, the “Slow Mode” is irrelevant.


Figure 52: Typical bus transactions on the I2C bus.

Addressing: Each device on the bus has its own unique address. Before any data is transmitted on the bus, the master transmits on the bus the address of the slave to be accessed for this transaction. A well-behaved slave with a matching address, if it exists on the network, should of course acknowledge the master’s addressing. The addressing is done by the first byte transmitted by the master after the Start condition. An address on the network is seven bits long, appearing as the most significant bits of the address byte. The last bit is a direction (R/W) bit. A zero indicates that the master is transmitting (Write) and a one indicates that the master requests data (Read). A complete data transfer, comprised of an address byte indicating a Write and two data bytes is shown in Figure 52. When an address is sent, each device in the system compares the first seven bits after the Start with its own address. If there is a match, the device will consider itself addressed by the master, and will send an acknowledgement. The device could also determine if in this transaction it is assigned the role of a slave receiver or slave transmitter, depending on the R/W bit. Each node of the I2C network has a unique seven bit address. The address of a processing elements or microcontroller is usually fully programmable, while peripheral devices usually have fixed and programmable address portions. In addition to the “standard” addressing discussed here, the I2C bus protocol allows for “general call” addressing.

Transfer Formats: When the master is communicating with one device only, data transfers follow the format of Figure 52, where the R/W bit could indicate either direction. After completing the transfer and issuing a Stop condition, if a master would like to address some other device on the network, it could of course start another transaction, issuing a new Start. Another way for a master to communicate with several different devices would be by using a “repeated start”. After the last byte of the transaction was transferred, including its acknowledge (or negative acknowledge), the master issues another Start, followed by address byte and data - without effecting a Stop. The master may communicate with a number of different devices, combining Reads and Writes. After the last transfer takes place, the master issues a Stop and releases the bus. Possible data formats are demonstrated in Figure 52. Note that the repeated start allows for both change of a slave and a change of direction, without releasing the bus. In a single master system, the repeated start mechanism may be more efficient than terminating each transfer with a Stop and starting again. In a multi-master environment, the determination of which format is more efficient could be more complicated, as when a master is using repeated starts it occupies the bus for a long time and thus preventing other devices from initiating transfers.

Application interface: The I2C interface on a processing element or microprocessor can be implemented with varying percentages of the functionality in software or hardware. I2C bus slaves normally are implemented in hardware as otherwise the processing elements would be busy polling for data and thus have a reduced processing power.


The I2C specification only defines the physical and data link layers as described above. Any further layers as we have learned from the OSI model are not defined in the I2C specification. It is up to the designer to design and implement the application layer and the necessary amount of the OSI layers functionality in between. Generally the I2C is meant to be a simple and cheap interface for low data throughput. Typical applications where the I2C interface is popular is for initialization and command word exchange between different processing or peripheral elements like MP3 encoder, AD converter, small EEPROMs, intelligent sensors and so on.

1.5.5. The CAN Bus

Developed by Bosch in Germany [7],[8], the CAN bus (Controller Area Network) was originally designed for automotive electronics. As digital electronics were introduced into automotive components, not only did the individual components get smarter, but the need fro them to communicate in order to execute their functions grew. Today, CAN is used in applications other than automotive systems and is thus an ideal general industrial bus. Some users, for example in the field of medical engineering, opted for CAN because they have to meet particularly stringent safety requirements. Similar problems are faced by manufacturers of other equipment with very high safety or reliability requirements (e. g. robots, lifts and transportation systems). CAN networks can be used as an embedded communication system for microcontrollers as well as an open communication system for intelligent devices. The CAN serial bus system is increasingly being used in industrial field bus systems, because of it low cost ability, the ability to function in a difficult electrical environment, a high degree of real-time capability and ease of use. However, CAN provides a good example of a distributed network for applications such as automobiles. A typical system configuration using the CAN bus is shown in Figure 53.

Figure 53: Structure of a CAN bus system.

The CAN bus uses a simple two-wire differential serial bus system, it operates in noisy electrical environments with a high level of data integrity, and its open architecture and user-definable transmission medium make it extremely flexible. Capable of high-speed (1 Mbits/s) data transmission over short distances (40 m) and low-speed (5 kbits/s) transmissions at lengths of up to 10 km, the multi-master CAN bus is highly fault tolerant, with powerful error detection and handling designed in.

CAN is a serial bus system with multi-master capabilities, that is, all CAN nodes are able to transmit data and several CAN nodes can request the bus simultaneously. The serial bus system with real-time capabilities is the subject of the ISO 11898 international standard and covers the lowest two layers of the ISO/OSI reference model. In the specification of CAN 2.0A the standard (11-bit) message format, in CAN 2.0B in addition the extended (29-bit) message formats are described. In CAN networks there is no addressing of subscribers or stations in the conventional sense, but instead, prioritized messages are transmitted.

The CAN protocol is based on a layered structure according to the OSI model (see Figure 54). The lowest level layer is the physical layer which defines how signals are actually transmitted. Within the CAN specification the physical layer is not defined so as to allow transmission medium and signal level implementations to be optimized for their application. The data link layer is described by the so called CAN transfer and CAN object layers. The transfer layer represents the kernel of the CAN protocol. It

master #1transmitter& receiver


slave #1transmitter& receiver







presents messages received to the object layer and accepts messages to be transmitted from the object layer. The transfer layer is responsible for bit timing and synchronization, message framing, arbitration, acknowledgement, error detection and signaling, and fault confinement. The object layer is concerned with message filtering as well as status and message handling.

Figure 54: Layered structure of a CAN node.

Physical Layer:

As shown in Figure 55, each node in the CAN bus has its own electrical drivers and receivers that connect the node to the bus in a wired-AND fashion. The driving circuits on the bus cause the bus to be pulled down to ’0’ if a node on the bus pulls the bus down (making ’0’ dominant to a ’1’). As the CAN specification does not define the physical layer, application domain specific electrical characteristics can thus be found. Figure 55 illustrates a typical 2-wire differential CAN bus realization used for long distances or high speed requirements. For low speed requirements or in un-noisy environments even single line CAN buses can be seen as long as a common ground is available (see Figure 56).

Figure 55: Physical and electrical organization of a differential line CAN bus.

transceiver

C_H

C_L

CANcontrollernode #n

Vdiff

Vgnd_n

Vdd_n

transceiver


Vgnd_m

Vdd_m

bustermination

CAN bus

transceiver

C_H

C_L


Vdiff

Vgnd_n

Vdd_n

transceiver


Vgnd_m

Vdd_m

bustermination

CAN bus

Application Layer

Data Link Layer

(Object Layer)-message filetring-message and status handling

(Transfer Layer)-fault confinement-error detection and signalling-message validation-acknowledgement-arbitration-message framing-transfer rates and timing

Physical Layer

-signal level and bit representation- transmission medium

Application Layer

Data Link Layer

(Object Layer)-message filetring-message and status handling

(Transfer Layer)-fault confinement-error detection and signalling-message validation-acknowledgement-arbitration-message framing-transfer rates and timing

Physical Layer

-signal level and bit representation- transmission medium


Figure 56: Physical and electrical organization of a single line CAN bus for low speed requirements in un-noisy environments.

Data Link Layer

Principles of data exchange: When data are transmitted by CAN, no nodes are addressed, but instead, the content of the message (e.g. rpm or engine temperature) is designated by an identifier that is unique throughout the network. The identifier defines not only the content but also the priority of the message. This is important for bus allocation when several nodes are competing for bus access. If the CPU of a given node wishes to send a message to one or more nodes, it passes the data to be transmitted and their identifiers to the assigned CAN controller (”make ready”). This is all the CPU has to do to initiate data exchange. The message is constructed and transmitted by the CAN controller. As soon as the CAN controller receives the bus allocation (”send message”) all other nodes on the CAN network become receivers of this message (”receive message”). Each node in the CAN network, having received the message correctly, performs an acceptance test to determine whether the data received are relevant for that node (”select”). If the data are of significance for the node concerned they are processed (”accept”), otherwise they are ignored (see Figure 57)

Figure 57: Broadcast transmission and acceptance filtering by CAN nodes.

A high degree of system and configuration flexibility is achieved as a result of the content-oriented addressing scheme. It is very easy to add nodes to the existing CAN network without making any hardware or software modifications to the existing nodes, provided that the new nodes are purely receivers. Because the data transmission protocol does not require physical destination addresses for the individual components, it supports the concept of modular electronics and also permits multiple reception (broadcast, multicast) and the synchronization of distributed processes: measurements needed as information by several controllers can be transmitted via the network, in such a way that it is unnecessary for each controller to have its own sensor.

buspull-upCAN bus

Vdd

CANcontrollernode #2


buspull-upCAN bus

Vdd



CAN node #1

(receiver)

select

receivemessage

accept

CAN node #2

(transceiver)

sendmessage

prepare

CAN node #3

(receiver)

select

receivemessage

accept

CAN node #4

(receiver)

select

receivemessage

CAN bus

CAN node #1

(receiver)

select

receivemessage

accept

CAN node #2

(transceiver)

sendmessage

prepare

CAN node #3

(receiver)

select

receivemessage

accept

CAN node #4

(receiver)

select

receivemessage

CAN bus


Figure 58: Principle of non-destructive bitwise arbitration.

Non-destructive bitwise arbitration: For the data to be processed in real time they must be transmitted rapidly. This not only requires a physical data transfer path with up to 1Mbit/s but also calls for rapid bus allocation when several nodes wish to send messages simultaneously. In real-time processing the urgency of messages to be exchanged over the network can differ greatly: a rapidly changing dimension (e.g. engine load) has to be transmitted more frequently and therefore with less delays than other dimensions (e.g. engine temperature) which change relatively slowly. The priority at which a message is transmitted compared with another less urgent message is specified by the identifier of the message concerned. The priorities are laid down during system design in the form of corresponding binary values and cannot be changed dynamically. The identifier with the lowest binary number has the highest priority. Bus access conflicts are resolved by bitwise arbitration on the identifiers involved by each node observing the bus level bit for bit. In accordance with the “wired-AND” mechanism, by which the dominant state (logical ‘0’) overwrites the recessive state (logical ‘1’), the competition for bus allocation is lost by all those nodes with recessive transmission and dominant observation. All “losers” automatically become receivers of the message with the highest priority and do not reattempt transmission until the bus is available again (see Figure 58).

In order to process all transmission requests of a CAN network while complying with latency constraints at as low a data transfer rate as possible, the CAN protocol must implement a bus allocation method that guarantees that there is always unambiguous bus allocation even when there are simultaneous bus accesses from different nodes. The method of bitwise arbitration using the identifier of the messages to be transmitted uniquely resolves any collision between a number of stations wanting to transmit, and it does this at the latest within 13 (standard format) or 33 (extended format) bit periods for any bus access period. Unlike the message-wise arbitration employed by the CSMA/CD method this nondestructive method of conflict resolution ensures that no bus capacity is used without transmitting useful information. Even in situations where the bus is overloaded the linkage of the bus access priority to the content of the message proves to be a beneficial system attribute compared with existing CSMA/CD or token protocols: in spite of the insufficient bus transport capacity, all outstanding transmission requests are processed in order of their importance to the overall system (as determined by the message priority). The available transmission capacity is utilized efficiently for the transmission of useful data since “gaps” in bus allocation are kept very small. The collapse of the whole transmission system due to overload, as can occur with the CSMA/CD protocol, is not possible with CAN. Thus, CAN permits implementation of fast, traffic-dependent bus access which is non-destructive because of bitwise arbitration based on the message priority employed, the Carrier Sense Multiple Access with Arbitration on Message Passing (CSMA/AMP) method is used

As seen earlier, we distinguish between centralized and decentralized bus access control. The concept of centralized bus access control has the disadvantage that the strategy for failure management is difficult and costly to implement and also that the takeover of the central node by a redundant node can be very time-consuming. For these reasons and to circumvent the problem of the reliability of the master node (and thus of the whole communication system), the CAN protocol implements decentralized bus control. All major communication mechanisms, including bus access control, are

CAN bus

CANnode #1

CANnode #2

CANnode #3

CAN bus

CANnode #1

CANnode #2

CANnode #3

CAN bus

node #1

node #2

node #3

node #1loses

node #3loses

CAN bus

node #1

node #2

node #3

node #1loses

node #3loses


implemented several times in the system, because this is the only way to fulfill the high requirements for the availability of the communication system.

In summary it can be said that CAN implements a traffic-dependent bus allocation system that permits, by means of a non-destructive bus access with decentralized bus access control, a high useful data rate at the lowest possible bus data rate in terms of the bus busy rate for all nodes. The efficiency of the bus arbitration procedure is increased by the fact that the bus is utilized only by those nodes with pending transmission requests. These requests are handled in the order of the importance of the messages for the system as a whole. This proves especially advantageous in overload situations. Since bus access is prioritized on the basis of the messages, it is possible to guarantee low individual latency times in real-time systems.

Figure 59: The CAN standard data frame format.

Message frame formats: The CAN protocol supports two message frame formats, the only essential difference being in the length of the identifier (ID). In the standard format the length of the ID is 11 bits and in the extended format the length is 29 bits. The message frame for transmitting messages on the bus comprises seven main fields (see Figure 59). A message in the standard format begins with the start bit “start of frame”, this is followed by the “arbitration field”, which contains the identifier and the “RTR” (remote transmission request) bit, which indicates whether it is a data frame or a request frame without any data bytes (remote frame). The “control field” contains the IDE (identifier extension) bit, which indicates either standard format or extended format, a bit reserved for future extensions and - in the last 4 bits - a count of the data bytes in the data field. The “data field” ranges from 0 to 8 bytes in length and is followed by the “CRC field”, which is used as a frame security check for detecting bit errors. The “ACK field”, comprises the ACK slot (1 bit) and the ACK delimiter (1 recessive bit). The bit in the ACK slot is sent as a recessive bit and is overwritten as a dominant bit by those receivers which have at this time received the data correctly (positive acknowledgement). Correct messages are acknowledged by the receivers regardless of the result of the acceptance test. The end of the message is indicated by “end of frame”. “Intermission” is the minimum number of bit periods separating consecutive messages. If there is no following bus access by any station, the bus remains idle (“bus idle”).

The CAN protocol was extended by the introduction of a 29-bit identifier. This identifier is made up of the existing 11-bit identifier (base ID) and an 18-bit extension (ID extension). Thus the CAN protocol allows the use of two message formats: Standard-CAN (Version 2.0A) and Extended-CAN (Version 2.0B). As the two formats have to coexist on one bus it is laid down which message has higher priority on the bus in the case of bus access collisions with dithering formats and the same base identifier: the message in standard always has priority over the message in extended format. CAN controllers which support the messages in extended format can also send and receive messages in standard format. When CAN controllers which only cover the standard format (Version 2.0A) are used on one network, then only messages in standard format can be transmitted on the entire network

inter-framespace

star

t arbitrationfield

controlfield

datafield

CRCfield ac

kfie

ld

1 12 6 0 to 64 16 2 7

iden

tifie

r

RTR

bit

data

leng

thco

deiden

tifie

rex

tens

ion

r0 AC

K s

lot

AC

K

delim

iter

Standard Data Frame

end offrame

interframespace

11 1 1 1 4 1 1

CR

C fi

eld

CR

C

delim

iter

15 1

inter-framespace

star

t arbitrationfield

controlfield

datafield

CRCfield ac

kfie

ld

1 12 6 0 to 64 16 2 7

iden

tifie

r

RTR

bit

data

leng

thco

deiden

tifie

rex

tens

ion

r0 AC

K s

lot

AC

K

delim

iter

Standard Data Frame

end offrame

interframespace

11 1 1 1 4 1 1

CR

C fi

eld

CR

C

delim

iter

15 1


Detecting and signaling errors: Unlike other bus systems, the CAN protocol does not use acknowledgement messages but instead signals any errors that occur. For error detection the CAN protocol implements three mechanisms at the message level:

• Cyclic Redundancy Check (CRC): The CRC safeguards the information in the frame by adding redundant check bits at the transmission end. At the receiver end these bits are re-computed and tested against the received bits. If they do not agree there has been a CRC error.

• Frame check: This mechanism verifies the structure of the transmitted frame by checking the bit fields against the fixed format and the frame size. Errors detected by frame checks are designated “format errors”.

• ACK errors: As mentioned above, frames received are acknowledged by all recipients through positive acknowledgement. If no acknowledgement is received by the transmitter of the message (ACK error) this may mean that there is a transmission error which has been detected only by the recipients, that the ACK field has been corrupted or that there are no receivers.

The CAN protocol also implements two mechanisms for error detection at the bit level.

• Monitoring: The ability of the transmitter to detect errors is based on the monitoring of bus signals: each node which transmits also observes the bus level and thus detects differences between the bit sent and the bit received. This permits reliable detection of all global errors and errors local to the transmitter.

• Bit stuffing: The coding of the individual bits is tested at bit level. The bit representation used by CAN is NRZ (non-return-to-zero) coding, which guarantees maximum efficiency in bit coding. The synchronization edges are generated by means of bit stuffing, i.e. after five consecutive equal bits the sender inserts into the bit stream a stuff bit with the complementary value, which is removed by the receivers. The code check is limited to checking adherence to the stuffing rule.

If one or more errors are discovered by at least one node (any node) using the above mechanisms, the current transmission is aborted by sending an “error flag”. This prevents other nodes accepting the message and thus ensures the consistency of data throughout the network. After transmission of an erroneous message has been aborted, the sender automatically re-attempts transmission (automatic repeat request). There may again be competition for bus allocation. As a rule, retransmission will be begun within 23 bit periods after error detection; in special cases the system recovery time is 31 bit periods. However effective and efficient the method described may be, in the event of a defective node it might lead to all messages (including correct ones) being aborted, thus blocking the bus system if no measures for self-monitoring were taken. The CAN protocol therefore provides a mechanism for distinguishing sporadic errors from permanent errors and localizing node failures (fault confinement). This is done by statistical assessment of node error situations with the aim of recognizing a node’s own defects and possibly entering an operating mode where the rest of the CAN network is not negatively affected. This may go as far as the node switching itself off to prevent messages erroneously recognized as incorrect from being aborted.

Time triggered CAN (TTCAN) has been developed to address the needs of real-time systems with hard time constraints (ISO11898 part 4). Purely time triggered operation of a communication network means that activity is determined by the progression of globally synchronized time. Communication depends upon a pre-defined time schedule, i.e. defined at design time. This is achieved by:

• time globally synchronized via a global time master • ability to switch off retransmission of messages if arbitration is lost


Figure 60: An example of a Basic Cycle in time triggered CAN communication. Any sequence of the windows between the reference messages is possible but must be defined at design time.

The main characteristics of TTCAN is that bus access is controlled via a Time Division Multiplexed Access (TDMA) like method using a regularly repeating cycle of time called the Basic Cycle (see Figure 60). The Basic Cycle is divided into a fixed number of time windows (i.e. fixed at design time) which can be a mixture of any one of four types:

• Reference Message: This is sent by the time master control unit (global time master) and controls the timing of the Basic Cycle. The Reference Message signifies the start of the Basic Cycle. The global time can be sent in four data bytes leaving the remaining four data bytes available for general data transfer.

• Exclusive Window: This is a time slice long enough to accommodate the message and data to be transmitted. The Exclusive Window is reserved for one particular CAN message only.

• Arbitration Window: In an Arbitration Window a number of nodes may attempt to transmit a message. Therefore the nodes that may contend for bus access during the Arbitration Window may do so by the usual non-destructive bitwise arbitration method. Thus the message with the lowest CAN identifier will win arbitration. With normal CAN systems, nodes losing arbitration will attempt to retransmit the message. This is disabled in TTCAN since a retransmission would upset the remainder of the operation of the Basic Cycle.

• Free Window: A Free Window is reserved for future expansion of the TTCAN system. Therefore further nodes can be added at a later date.

All nodes communicating on a bus which is to be time triggered, must have the retransmission of message that have lost arbitration disabled.

CAN has enjoyed enormous success in both automotive and automation applications. Future automotive control systems will utilize data communication networks in safety critical applications. Implementing TTCAN in software would result in a quite high workload of the host controller, while a hardware implementation would easily guarantee the hard time constraints necessary in TTCAN. Whilst TTCAN is limited in bandwidth and does not provide for all of the needs of such applications, it is very useful as the enabling technology in the first generation systems. More recent Time Triggered Protocols (TTP) offer higher bandwidths.

1.5.6. Ethernet

Ethernet is very widely used as a local area network for general purpose computing. Because of its ubiquity and the low cost of Ethernet interfaces, it has seen significant use as a network for embedded computing. Ethernet is particularly useful when PCs are used as platforms, making it possible to use standard components, and when the network does not have to meet rigorous real-time requirements.

The physical organization of an Ethernet is very simple, as shown in Figure 61. The network is a bus with a single signal path; the Ethernet standard allows for several different implementations such as twisted pair, coaxial or fiber cable.

referencemessage

exclusivewindow

arbitrationwindow

freewindow

exclusivewindow

Basic Cycle

referencemessage

referencemessage

exclusivewindow

arbitrationwindow

freewindow

exclusivewindow

Basic Cycle

referencemessage


Figure 61: Ethernet organization.

Unlike the previously described I2C and CAN bus, nodes on the Ethernet are not synchronized – they can send their bits at any time. I2C and CAN rely on the fact that a collision can be detected and quashed within a single bit time thanks to synchronization. But since Ethernet nodes are not synchronized, if two nodes decide to transmit at the same time, the message will be ruined. The Ethernet arbitration schema is known as Carrier Sense Multiple Access with Collision Detection (CSMA/CD). A node that has a message waits for the bus to become silent and then starts transmitting. It simultaneously listens, and if it hears another transmission that interferes with its transmission, it stops transmitting and waits to retransmit. The waiting time is random, but weighted by an exponential function of the number of times the message has been aborted. Figure 62 shows the exponential backoff function both before and after it is modulated by the random wait time. Since a message may be interfered with several times before it is successfully transmitted, the exponential backoff technique helps to ensure that the network does not become overloaded at high demand factors. The random factor in the wait time minimizes the chance that two messages will repeatedly interfere with each other. The maximum length of an Ethernet is determined by the nodes ability to detect collisions. The worst case occurs when two nodes at opposite ends of the bus are transmitting simultaneously. For the collision to be detected by both nodes, each node’s signal must be able to travel to the opposite end of the bus so that it can be heard by the other node. In practice, Ethernets can run up to several hundred meters.

Figure 62: Exponential backoff time of Ethernet bus.

Figure 63 shows the basic format of an Ethernet packet. It provides addresses of both the destination and the source. It also provides for a variable-length data payload. The fact that it may take several attempts to successfully transmit a message and that the waiting includes a random factor makes Ethernet performance difficult to analyze. It is possible to perform data streaming and other real-time activities on Ethernets, particularly when the total network load is kept to a reasonable level, but care must be taken in designing such systems.

Figure 63: Ethernet packet format.

senderreceivernode # 1








0 1 2 3# of attempts

wait time

exp weightingfactor

randomditheredtimes

0 1 2 3# of attempts

wait time

exp weightingfactor

randomditheredtimes

preamble startframe

destinationaddress

sourceaddress

length data padding CRCpreamble startframe

destinationaddress

sourceaddress

length data padding CRC


1.5.7. Internet

The Internet Protocol (IP) is the fundamental protocol on the Internet. It provides connectionless, packet-based communication. Industrial automation has long been a good application area for Internet-based embedded systems. Information appliances that use the Internet are rapidly becoming another use of IP in embedded computing.

IP is not defined over a particular physical implementation – it is an inter-network standard. Internet packets are assumed to be carried by some other network, such as Ethernet. In general, an Internet packet will travel over several different networks from source to destination. The IP allows data to flow seamlessly through these networks from one end user to another. The relationship between IP and individual networks is illustrated in Figure 64. IP works at the network layer of the OSI model. When a node A wants to send data to node B, the application’s data pass through several layers of the protocol stack to get to the Internet Protocol. IP creates packets for routing to the destination, which are then sent to the data link and physical layers. A node that transmits data among different types of networks is known as a router. The router’s functionality must go up to the IP layer, but since it is not running applications, it does not need to go to higher levels of the OSI model. In general, a packet may go through several routers to get to its destination. At the destination, the IP layer provides data to the transport layer and ultimately the receiving application. As data pass through several layers of the protocol stack, the IP packet data are encapsulated in packet formats appropriate to each layer.

Figure 64: Protocol utilization in Internet communication.

The basic format of an IP packet is shown in Figure 65. The header and data payload are both of variable length. The maximum total length of the header and data payload is 65535 bytes.

Figure 65: IP packet structure.

An Internet address is a number (32 bits in early versions of IP, 128 bits in IPv6). The IP address is typically written in the format xxx.xx.xx.xx. The names by which users and applications typically refer to Internet nodes, such as www.microlab.ch, are translated into IP addresses via calls to a Domain Name Server (DNS), one of the higher-level services built on top of IP.

The fact that IP works at the network layer tells us that it does not guarantee that a packet is delivered to its destination. Furthermore, packets that do not arrive may come

appliation...transportnetworkdata linkphysical


networkdata linkphysical

IP

node A node Brouter



networkdata linkphysical

IP

node A node Brouter

version headerlength

servicetype

total length

identification flags fragment offsettime to live protocol header checksum

source addressdestination addressoptions of paddingdata...

head

erda

ta p

aylo

ad

version headerlength

servicetype

total length

identification flags fragment offsettime to live protocol header checksum

source addressdestination addressoptions of paddingdata...

head

erda

ta p

aylo

ad


out of order. This is referred to as best-effort routing. Since routes for data may change quickly with subsequent packets being routed along very different paths with different delays, real-time performance of IP can thus be hard to predict. When a small network is contained totally within the embedded system, performance can be evaluated through simulation or other methods because the possible inputs are limited. Since the performance of the Internet may depend on worldwide usage patterns, its real-time performance is inherently harder to predict.

The Internet also provides higher-level services built on top of IP. The Transmission Control Protocol (TCP) is one such example. It provides a connection-oriented service that ensures that data arrive in the appropriate order, and it uses acknowledgement protocol to ensure that packets arrive. Because many higher-level services are built on top of TCP, the basic protocol is often referred to as TCP/IP. Figure 66 shows the relationship between IP and higher-level Internet services. Using IP as the foundation, TCP is used to provide File Transfer Protocol (FTP) for batch file transfers, Hypertext Transport Protocol (HTTP) for World Wide Web services, Simple Mail Transfer Protocol (SMTP) for email, and Telnet for virtual terminals. A separate transport protocol, User Datagram Protocol (TDP), is used as the basis for the network management services provided by the Single Network Management Protocol (SNMP). Interested readers my further read the corresponding literature for more information about the different higher-level services of the Internet service stack.

Figure 66: The Internet service stack.

1.5.8. Network-Based Design

Designing a distributed embedded system around a network involves scheduling of computations in time and allocating them to processing elements. Scheduling and allocation of communication are important additional design tasks required for many distributed networks. Many embedded networks are designed for low cost and therefore do not provide excessively high communication speed. If we are not careful, the network can become the bottleneck in system design. In this section we concentrate on design tasks unique to network-based distributed embedded systems.

Communication Analysis

We know how to analyze the execution time of programs and systems of processes on single CPUs, but to analyze the performance of networks we must know how to determine the delay incurred by transmitting messages. Let us assume for the moment that messages are send reliably – we do not have to retransmit a message. The message delay for a single message with no contention (as would be the case in a point-to-point connection) can be modeled as

tm= tx+ tn+ tr

where tx is the transmitter-side overhead, tn is the network transmission time, and tr is the receiver-side overhead. In I2C, tx and tr are negligible relative to tn. If messages can interfere with each other in the network, analyzing communication delay becomes difficult. In general, because we must wait for the network to become available and then transmit the message, we can write the message delay as

ty= td+ tm

FTP HTTP SMTP Telnet SNMP

TCP UDP

IP

FTP HTTP SMTP Telnet SNMP

TCP UDP

IP


where td is the network availability delay incurred waiting for the network to become available. The main problem, therefore, is calculating td. That value depends on the type of arbitration used in the network.

• If the network uses fixed-priority arbitration, the network availability delay is unbounded for all but the highest-priority device. Since the highest-priority device always gets the network first, unless there is an application-specific limit on how long it will transmit before relinquishing the network, it can keep blocking the other devices indefinitely.

• If the network uses fair arbitration, the network availability delay is bounded. In the case of round-robin arbitration, if there are N devices, then the worst-case network availability delay is N(tx+ tarb), where tarb is the delay incurred for arbitration. tarb is usually small compared to transmission time.

Even when round-robin arbitration is used to bound the network availability delay, the waiting time can be very long. It is worthwhile to examine the application to determine whether the message structure can be readjusted to reduce td as shown in the example below:

Example: Adjusting messages to improve network delay:

Assume we want to implement the task graph on the network shown in Figure 67. We will allocate task P1 to processing element PE1, P2 to PE2, and P3 to PE3. P1, P2 and P3 run each for four time. A complete transmission of either d1 or d2 takes four time units, too. The task graph shows that P3 cannot start until it receives its data from both P1 and P2 over the bus network. The simplest implementation transmits all required data in one large message, which is four packets long in this case. In Figure 67 is a schedule based on that message structure. P3 does not start until time 11, when the transmission of the second message has been completed. The total schedule length is 16

Figure 67: Example of task graph running on network. Network timing for simple implementation.

Let’s redesign P3 so that it does not require all of both messages to begin. We modify the program so that it reads one packet of data each from d1 and d2 and starts computing on that. If it finishes what it can do on that data before the next packets from d1 and d2 arrive, it waits; otherwise, it pick up the packets and keeps computing. This organization allows us to take advantage of concurrency between PE3 processing element and the network as shown by the schedule in Figure 68. Reorganizing the messages so that they can be send concurrently with P3’s execution reduces the schedule length from 16 to 13, even with P3 stopping to wait for more data from P1 and P2.

P1

P2

P3

d1

d2

PE1 PE2 PE3

P1

P2

d1 d2

P3

PE1

PE2

PE3

network time

0 5 10 15 20

P1

P2

P3

d1

d2

PE1 PE2 PE3

P1

P2

P3

d1

d2

P1

P2

P3

P1

P2

P3

d1

d2

PE1 PE2 PE3PE1 PE2 PE3

P1

P2

d1 d2

P3

PE1

PE2

PE3

network time

0 5 10 15 20


Figure 68: Example with reorganized scheduling.

Our process scheduling model assumed that we could interrupt processes at any point. But network communications are organized into packets. In most networks we cannot interrupt a packet transmission to take over the network for a higher-priority packet. As a result, networks exhibit priority inversion. When a low priority message is on the network, the network is effectively allocated to that low-priority message, allowing it to block higher-priority messages. This cannot cause deadlock since each message has a bounded length, but it can slow down critical communications. The only solution is to analyze network behavior to determine whether priority inversion causes some messages to be delayed for too long.

Scheduling and allocation of computations and communication are clearly interrelated. If we change the allocation of computations, we change not only the scheduling of processes on those PE’s but also potentially the schedules of PEs with which they communicate. For example if we move a computation to a slower PE, its results will be available later, which may mean rescheduling both the process that uses the value and the communication that sends the value to its destination.

System Performance Analysis

Unfortunately, analyzing the performance of distributed embedded systems is very difficult. We can understand some of the difficulties by starting with simple cases. Figure 69 shows a very simple task graph with two processes and one n-packet data communication. The worst-case execution time of the processes are tp1 and tp2 and the communication takes ntx time units. This case is simple because there is no interference from outside elements. We know that the two processes cannot execute simultaneously since there is a data dependency between them and there are no other processes in the system to interfere. Similarly, noting can interfere with the communication, and everything happens at a single rate. In this case the worst-case execution time is tp1 + ntx + tp2.

Figure 69: Delay through a simple task graph.

When we allow computations and communications to interfere with each other, performance analysis becomes much harder. Consider the example of Figure 70, where we have superimposed the task graph on the target architecture. P2 and P3 run on the same PE, which helps enable the following chain of events that can affect the whole system.

• The data dependency from P1 to P2 translates any uncertainty in the execution time of P1 into uncertainty about the start time of P2.

• The co-allocation of P2 and P3 to processing element PE3 means that variations in the ready time of P2 can affect the completion time of P3. Of course, variations in the execution time of P2 also affect P3.

• The data dependency from P3 to P4 translates variations in the completion time of P3 to the start time of P4.

d1d2d1d2d1d2d1d2

d3 d3 d3 d3

P1

P2

PE1

PE2

PE3

network time

0 5 10 15 20

d1d2d1d2d1d2d1d2

d3 d3 d3 d3

P1

P2

PE1

PE2

PE3

network time

0 5 10 15 20

P1 P2ntx

tp1 tp2

P1 P2ntx

tp1 tp2


Figure 70: A distributed system with multirate concurrency.

Therefore, even though P2 and P3 are in separate tasks, the fact that they are allocated to the same PE causes them to interact with each other in ways that affect completion time of every task in the system. Similarly, messages from the two tasks can interfere with each other on the bus, causing yet more variations in completion time.

Complex distributed embedded systems require CAD tools to accurately analyze performance. Algorithms can be used to efficiently determine upper and lower bounds on the start and completion times of processes. If you don’t have tools to help you analyze performance, care is essential when hand-designing a system that has to meet hard real-time deadlines. When a system is trying to meet hard deadlines, it is important to make sure that only one task – the critical task – is active. In many cases, user interference activity and other nonessential tasks can be turned off temporarily. When there are several critical tasks that must occur simultaneously, hand design requires allocating them to share nothing – no PEs nor communication links. While this is a conservative design strategy, it makes hand analysis feasible. CAD tools can help loosen some of these restrictions and allow more hardware-efficient designs.

Hardware Platform Design, Allocation, and Scheduling

Now that we know how to compute the delay for messages, we can develop strategies for designing the schedule and allocation of processes and communication. Designing the hardware platform is necessarily closely related to our choices in scheduling and allocating processes. We want to use only as much hardware as is necessary, but we cannot know how much hardware to use until we can construct a system schedule. Creating that schedule requires an allocation of processes to PEs, which in turn requires knowing the available hardware. When designing the platform, we have the following design choices to make:

• number of PEs required • types of all PEs • number of networks required • types (and data rates) of the network

In making these choices, we need to construct allocations and schedules for the processes to evaluate the platform. In turn, allocation and scheduling are driven by system performance analysis.

Depending on the type of system we are designing, the following two strategies may be useful to help us quickly come up with an efficient system:

• For I/O-intensive systems we will start with the I/O devices and their associated processes.

• For computation-intensive systems we will start with the processes.

For systems that do a lot of I/O, we definitely need to support the I/O devices themselves and perhaps do some processing of the data locally before shipping the data over the network:

• Inventory the required I/O devices. • Determine which processing has deadlines short enough that they cannot be

met by any network within your price range. I/O devices that do not require

PE1

P1

PE3

P4

PE2

P2P3

PE1

P1

PE3

P4

PE2

P2P3


local processing may be attached to the network with the simplest available interface.

• Determine which devices can share a processing element or network interface. • Analyze communication times to determine whether critical communications

may interfere with each other. Determine whether a complex network or multiple networks may be required to satisfy communication deadlines.

• Allocate the minimum required PE to go with the I/O device. • Design the rest of the system using the procedure for computation-intensive

systems.

For computation-intensive systems, we want to consider the processing and their deadlines and communication as follows:

• Start with the tasks with the shortest deadlines. The shorter the deadlines for the task, the more likely it is to require its own processing element or elements. If a high-priority task shares a PE with a lower-priority task, not only will a more expensive PE be required, but scheduling overhead will be paid for the nonlinear rate.

• Analyze communication times to determine whether critical communications may interfere with each other.

• Allocate lower-priority tasks to share PEs where possible.

After we have designed a basic system that meets our performance goals, we can improve it to satisfy power consumption or other requirements. Once you have an initial allocation, use the system schedule as a guide for fine tuning. By reallocating processes you may be able to improve one or more attributes, such as hardware cost, slack time in the schedule, power consumption, and so on. In particular, load balancing is often a good idea. If you have some PEs that are more heavily loaded than others, it may be possible to move some processes to other PEs. Doing so can reduce the chance that the system fails to meet a deadline due to mistaken estimation of run time.


1.6. References

[1] AMBA Specification, revision 2.0, 1999, ARM Limited

[2] Computers as Components: Principles of Embedded Computing System Design, W. Wolf, Morgan Kaufmann Publishers, 2001

[3] Open Systems Interconnection Reference Model, Day and Zimermann, 1983

[4] Distributed Operating Systems, Andrew Tanenbaum, Prentice Hall, 1995

[5] Using the 8XC751 Microcontroller as a I2C-Bus Master, Philips Semiconductor, Application Note AN422, 1993, http://www.semiconductors.philips.com/acrobat/applicationnotes/AN422.pdf

[6] Bi-directional Level Shifter for I2C-Bus and Other Systems, Philips Semiconductor, Application Note AN97055, 1997, http://www.semiconductors.philips.com/acrobat/applicationnotes/AN97055.pdf

[7] CAN Specification Version 2.0, Robert Bosch GmbH Stuttgart, 1991

[8] CAN home page at Bosch: http://www.can.bosch.com

vlsi system design - bfh · pdf filevlsi system design connection and communication ... some...

Documents