[ieee comput. soc 12th international workshop on rapid system protyping. rsp 2001 - monterey, ca,...

An Approach to Mapping the Timing Behavior of VLSI Circuits on Emulators

Pirouz Bazargan Sabet

LIP6 - University of Paris 6 4 Place Jussieu - 75005 Paris - France

Pirouz. Bazargan-Subet@ lip(5.fr

Abstract The time spent in simulation grows in an exponential

form with the complexity of the circuit. Therefore, improving the simulation speed can represent a significant profit regarding the verification time. Several approaches can be used to speedup the simulation. These recent years, FPGAs have been used to develop emulators. These systems are composed of several thousands of FPGAs connected together through a programmable network. Although this approach seems very attractive in regard of the speedup, all the information included in a circuit description cannot be mapped on the emulator. In this paper, we propose a method to reproduce the timing behavior of the circuit on an emulator.

1. Introduction The impact of the growing complexity of digital

integrated circuits on the time devoted to the verification task is a well-known problem. Simulation tools are certainly one the most often called software to certify the correctness of a design. Simulation is called not only to develop and to debug the initial specifications, but also to verify the various intermediate descriptions obtained during the design as well as the final extracted layout. Thanks to efficient simulation algorithms, the computation time required to simulate a given configuration varies linearly with the complexity of the circuit. However, the number of patterns needed to reach a satisfying level of reliability grows in an exponential way with the number of registers in the circuit. Thus, a great effort has been, and continue to be made to provide a speeded-up simulation to limit somewhat the time spent in this particular verification task.

Optimized simulation algorithms, such as Event- Driven and Demand-Driven [ I ], have been developed to reduce the number of evaluations. Then, researches have been focused on distributed simulation trying to take advantage from the amount of memory and computational resources available in local computer

Laurent Vuillemin

LIP6 - University of Paris 6 4 Place Jussieu - 75005 Paris - France

Laurent. Vuillemin @lip6.fr

networks. Innovative algorithms [2] have been proposed to partition the simulation workload on several work- stations, to balance the different processors’ charge and! to find a suitable tradeoff between the communication overhead and the evaluation load [3]. A great emphasis, has been specially put on developing different synchronization strategies [4] and on their optimizations [ 5 ] . Nevertheless, the speedup resulted from distributing the simulation load on several computers connected through a local network is not sufficient to compensate the growth of the circuits’ complexity.

In the same time, special purpose hardware, called hardware accelerators [6][7], designed to run efficiently the simulation algorithms have been commercialized. These accelerators can reach a speedup of several orders of magnitude over the classic sequential simulators. However, the cost of this kind of hardware remains very high in regard of the obtained speedup.

More recently, the development of FPGA’s has opened a new perspective. Composed of several thousands of FPGAs connected through a programmable network, emulators offer the fastest verification speed [8]. The description is first synthesized and mapped on the programmable circuits. Then, the resulted hardware prototype is excited by simulation patterns or put in its real environment (in-circuit emulation). The prototype can typically run at a frequency 10 to 100 times lower than the clock rate of the real circuit. Even if the cost of emulators is comparable to hardware accelerators, the speedup obtained over these later makes this solution much more attractive. In addition, the high emulation speed combined to in-circuit emulation feature offers the possibility of concurrent development of the circuit and its environment including the software. From the designer’s point of view, this factor plays an important role in reducing the time-to-market constraint.

However, unlike hardware accelerators, timing information cannot be reproduced on emulators and the simulation remains purely functional. Even if timing verification can be covered by static timing analysis tools, the limitations inherent to the static verification

1074-6005/01 $10.00 0 1001 IEEE 168

and the designers’ experience still leave an important place to the timed logic simulation.

In this paper, we propose a method to reproduce the timing specifications of the circuit on the FPGAs during the synthesis step. In the next section, we review the different type of delays that may be specified in a description. Section 3 gives an overview of a solution proposed in a previous paper. Section 4 details the way the emulator may be used to reproduce the timing behavior. In section 5 some experimental results are exposed and the conclusion and the future work are depicted in section 6.

2. The Delay Model Timing specifications are used in a description to

represent the propagation of information through the circuit. In timed logic simulation, the circuit is described as a set of processes connected together through signals. Basically, each signal is assigned by a unique process. Each time a signal assignation is evaluated by the simulator, a transaction is produced. A transaction comprises the signal’s identifier, the value assigned to the signal and a delay.

A special purpose memory, called scheduler, is used to keep trace of transactions. In the scheduler, transactions are ordered in regard of their absolute date (the absolute date is the sum of the current simulation date and the transaction’s delay).

The delay specified for each signal assignment may have various level of precision. A delay may be defined either statically or calculated dynamically during the simulation. The calculation of an accurate dynamic delay has to take into account a great number of parameters. Among other factors, this calculation depends on the input that has caused the transition, the edge of the input (falling or rising), its slope, the capacitance connected to the signal, the fanout of the signal, the length and the topology of the interconnections and even the environment of the signal. For example, the delay of a gate may vary if, during the transition of its output, its neighboring signals are also making a transition (crosstalk effect).

Obviously, the model used to calculate the delay of each assignment has an important impact on the performance of the simulation. A dynamic model, even the simplest one, may slow down the simulation by several orders of magnitude compared to a static model. Moreover, including a dynamic delay calculation inside the description leads to a more complex model, makes it less comprehensive and harden its debug.

We believe that using a dynamic delay model is unrealistic in a timed logic simulation that targets the

simulation an entire system containing millions of gates and that must be simulated for billions of cycles.

Another aspect is the type of the delay. In hardware description languages such as VHDL [9], a delay may represent a simple propagation delay or an inertial delay. Whenever an inertial transaction is send to the scheduler, all the transaction concerning the same signal that have a different value and an earlier absolute time must be removed from the scheduler. Non inertial transactions are simply added to the transaction list of the signal. Thus, mapping the timing behavior of a circuit on an emulator, the type of the delay used in the signal assignment must be taken into account.

3. Hardware Simulator The most obvious way of reproducing the timing

behavior of the circuit in the emulator is to map on FPGAs not the target circuit but the simulation algorithm.

Using this approach, the emulator is configured as a hardware accelerator, able to execute the simulation algorithm. The flexibility if the programmable devices is used to parameterize the accelerator in regard of the circuit’s characteristics.

This approach has been experimented in a previous work [ I O ] . In a first implementation, a sequential simulator using an Event-Driven algorithm has been built. The simulator was composed of three units: a Scheduler, a Transaction Management Unit (TMU) and an Elementary Simulator (ES) (Fig. 1).

Fig 1 : Hardware Simulator’s Architecture

The ES is in charge of computation of the new value of signals. The ES is mostly composed of read only memories. The description is first synthesized using 4- input gates. The obtained netlist is then mapped a ROM. For each gate, the netlist ROM contains the inputs’ address, the truth table and the static delay of the gate. The fanout of each gate is extracted from the netlist and stored in the Forward dependency ROM. The current value of signals are stored in a RAM. A second RAM is used to hold the gates that must be evaluated at each cycle. Each evaluation produces a transaction which is send to the TMU (Fig. 2).

Transactions that have a null delay specification are stored locally in the TMU and used in the next delta- cycle to update the signals. Those transactions that have

a non null delay are send to the scheduler and stored in a memory.

Netlist b)

x - 2 c, 0" t-2

Fig 2 : Details of the Elementry Simulator

Basically, the scheduler is a memory addressed by the signal's identifier and the transaction's absolute date. Transactions received from the TMU are stored one by one in this memory. During the update phase, all the transactions related to the current time are extracted from the scheduler, load in a shift register and send one by one through the TMU to the ES to update the current values (Fig. 3) .

9 - s!

Signal's identier

Fig. 3: The Scheduler

The experience has shown that, during the execute phase, the scheduler was mostly waiting for the transactions coming from the ES. During the update phase, even if all the transactions could be extracted in once from the scheduler, they had to be send one by one to the TMU.

Therefore, a second version of the simulator has been implemented to enhance the simulation's performance. This implementation was based on a synchronous distributed simulator. To fill the bandwidth of the scheduler, several ES has been connected to a scheduler. The scheduler itself has been split in several synchronous local schedulers to increase the input transactions bandwidth (Fig. 4).

Using this approach, transactions can be stored as they are produced by the ES. On the other hand, the number of ES connected to a given scheduler can be tuned to fill the bandwidth of the scheduler. During the Update phase, the transactions related to the current time are extracted in the same time from the schedulers and loaded in shift registers. A separate shift register is used to feed a given ES and all the ES can be updated in parallel.

However, updating signals inside a given ES remains sequential and each ES can process a single transaction at once.

( GbbalComokr

Scheduler Scheduler

Fig 4: Distributed Simulator

4. Timing Emulation To resolve this problem, a solution is to build a

specific scheduler dedicated to each single signal of the circuit. From the hardware point of view, since each scheduler will have 1-bit width, i t seems unrealistic to use a memory component to implement it. Instead, a sequential circuit has to be designed to fulfil the scheduler's function. This sequential circuit has to be parameterized regarding the type of the delay (inertial or non inertial) and its value.

.- 5%: C Y 0

Logic function

Fig. 5: A non inertial delay

A simple way to implement a non inertial delay is to use a FIFO (Fig. 5 ) . The signal's value is produced by a gate that drives the signal. At each clock cycle, a new value, produced by the driver, is loaded into the FIFO and the old values are shifted. Thus, a clock cycle of the emulator represents a time step for the simulator The

FIFO’s depth depends on the value of the delay attributed to the signal and on the resolution of the simulator. For example, a I O stage FIFO is required to map a delay of 20 ps with a resolution of 2 ps.

Since, the depth of the FIFO depends on the resolution, a great number of stages could be necessary to implement a delay inside a high resolution simulator. However, at each time step, the driver does not produce a different new value and most of these stages are used to store the same value.

Another solution is to load a new value into the FIFO only when the new different value is produced by the driver. Using this technique, the FIFO must retain not only the signal’s value but also the date of the transaction. Figure 6 shows a simplified view of the solution.

lnterval - counter

- Date FTFO

Event detector

Fig. 6: A non inertial delay

The different signal’s values are stored in the Value FIFO. The time interval between two subsequent values is stored in the Date FIFO.

Whenever a transition is detected at the output of the gate, a Push command is send to the Value and the Date FIFOs. The new value is pushed into the Value FIFO. If the FIFO was empty, the delay of the gate is pushed into the Date FIFO and the Interval Counter is initialized to zero. Otherwise, the value of the Interval Counter is pushed into the FIFO. The Time Counter is considered as the first stage of the Date FIFO. At each cycle, the Time Counter is decremented. When the Time Counter reaches zero, i t means that the delay of the transaction has been elapsed and a Pop command is send to the FIFOs and to the output register. Receiving a Pop command makes the values be shifted in the Value FIFO and the intervals in the Date FIFO. A new interval is then loaded into the Time Counter.

In this approach, the depth of the FIFOs depends on three parameters. As in the previous solution, the depth depends on the delay’s value and on the resolution. The third parameter is the number of events that may be generated, during the simulation, on a given signal within an interval equal to its delay. This parameter may be calculated statically before the synthesis step.

A similar approach can be used to implement inertial delays. Again, two implementations can be considered.

The first implementation uses a unique value FIFO (Fig. 7).

Fig. 7: An inertial delay

At each clock cycle, a new value is loaded into the FIFO. However, unlike non inertial transactions, the insertion of an inertial transaction has to remove all the transactions that has an earlier delay and a different value. In other terms, if at some stages, the value stored in the FIFO is different from the current value, then all the following stages must have the same value. Thus, if the new value produced at the gate’s output is the same as the current value and different from the value present in the last FIFO’s stage, then, all the values in the FIFO must be set to the same value.

In this implementation, again the depth of the FIFO depends on the resolution and on the delay’s value. Besides, i t can be noticed that the FIFO contains at most only one transition.

The second implementation is based on this last observation. Figure 8 gives a simplified view of this second approach. The basic idea is that the same value must be maintained at the output of an inertial gate at least for a duration equal to the delay of the gate.

An event detector is connected to the gate’s output. Whenever, the output makes a transition, the event detector initializes the Time Counter to the delay’s value. Then, at each clock cycle, the counter is decremented. When the Time Counter reaches the zero value, a Load command is send to the output register which is in charge of maintaining the output value. The Load command makes a new value be put on the output.

Thus, the output of the gate must be stable at least for a period equal to the inertial delay in order to become visible at the output.

Event detector

Logic function

Fig 8 : An inertial delay

5. Results The above techniques used to map the timing

behavior of a circuit on an emulator have been implemented on a SimExpress from Mentor-Graphics.

In a first step, the different sequential circuits used to reproduce the inertial and non inertial delays have been compared.

Figure 9 gives the number of 4 entries FPGA gates (BLP) required for the two proposed schemes to implement the inertial delay. It shows clearly that the technique based on a Time Counter is nearly always more efficient than the FIFO. The FIFO method uses a smaller number of BLPs only for very small delays and even for these cases the difference is not significant.

kganlhmic veioim - Itnear wrsm -------

O r I 53 1W 150 2w 253 300

Fig. 9: Nbr of BLP vs. non inertial delay

Figure I O compares the number of BLPs required to implement the inertial delay. For the technique that relies on counters (Fig. 6), the number of BLPs depends on the delay and on the number of transactions that the signal can receive within a time window equal to its delay.

Here, the comparison is clearly in favor of the scheme using the simple FIFO (Fig. 5) .

4Mao 3 m 3ww 2" 20003 15MO ioMO 5030

Fig. 10 : Nbr of BLP vs. inertial delay

In all the case the experience has shown that the resulted hardware can run at the maximum clock rate of the emulator (10MHz). This is not a surprising result because of the reduced combinatory part of the circuit

Compared to the first implementation, using separate schedulers leads to a more efficient simulation considering that all the transactions can be stored into the local schedulers in one cycle. The evaluation of signals is also enhanced since the input of each gate is directly connected to the corresponding scheduler's output.

Another important factor, is that the proposed technique preserves the in-circuit emulation features. Actually, there is a direct relation between the emulator's clock period and the simulation's time step. We can then evaluate the performances in terms of speed down compared to the real circuit rather than speed up compared to the software simulation. If the time step is p and the emulator frequency is Fe then the speed down is 1 1 (PE,) .

Obviously, the obtained performance must be balanced by the high cost of such distributed scheduler in terms of FPGA gate. Typically a timed emulation of a given circuit requires 100 to 500 times more BLPs compared to the classic emulation.

6. Conclusion The performance of simulation becomes critical as

the complexity of VLSIs increases. Several techniques can be used to maintain the time devoted to the simulation within a reasonable limit. The highest speedup is obtained by using emulators. However, the timing behavior of the circuit is not preserved during the synthesis. In this paper, we have proposed a technique able to reproduce a static delay specified for each signal

assignment. Two sequential hardware schemes have been proposed for both inertial or non inertial delays.

Although this method leads to high performance timed logic emulation that preserves the in-circuit features, the amount of required hardware may be considered as too excessive for a real size design.

7. References S.P. Smith , M.R. Mercer and B. Brock - “BACKSIM: Demand Driven Simulation” - ACM/IEEE 24Ih DAC -

C.J. Alpert and A.B. Khang - “Recent developments in Netlist Partitioning: A Survey” - Integration: The VLSI Journal - pp. 1-8 1 - 1995 A.R. Newton and C. Kring - “A Cell-Replicating Approach to Mincut Based Circuit Partitioning” - International Conference on Computer Aided Design -

K. Chandy and J. Mizra - “Asynchronous Distributed Simulation via a Sequence of Parallel Computations” - Communication of the ACM - Vol 24 - pp. 198-206 - 1981

pp. 181-187 - 1987

pp. 2-5 - 1991

M.L. Bailey, J.V. Briner Jr. and R.D. Chamberlain - “Parallel Logic Simulation of VLSI Systems” - ACM Computing Surveys - Vol 26, No 3 - pp. 255-294 - Sep. 1994 M. Abramovici, Y.H. Levendel and P.R. Menon - “A Logic Simulation Machine” - IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems - Vol CAD-2, No 2 - pp. 82-94 - 1983 T. Sasaki, N. Koike, K. Ohmori and K. Tomita - “HAL: A Block Level Hardware Logic Simulator” - ACM/lEEE 20‘“ DAC- pp. 150-156 - 1983 J. Varghese, M. Butts and J. Batcheller - “An Efficient Logic Emulation System” - IEEE Transactions on VLSI Systems-Vol l ,No2-pp . 171-174-Jun. 1993 IEEE - “IEEE Standard VHDL Language Reference Manual” - IEEE 1076 - 1987 L. Vuillemin and P. Bazargan Sabet - “Timed Simulation of VLSI Circuit Using A FPGA Net” - IASTED International Applied Informatics - pp. 2 18-222 - 2000

[ieee comput. soc 12th international workshop on rapid system protyping. rsp 2001 - monterey, ca,...

Documents

december 12th, 2001 pioneering 300

giant otter final...

scenography: close bounded with sound and light design?...

rapid prototyping using fractal...

s.f. bay area rapid transit 1 vehicle atc safety...

gao-01-984 mass transit: bus rapid transit shows promisesep...

novel gaussian beam method for the rapid analysis … ieee...

lisbon, 12th october 2011 creating 2001 to 2011 population...

references -...

initial rapid assessment report for flooding in narok...

uccts2010 handbook 12th july update - lancaster...

chf rapid guide 12th february 2012

5th intensive course on soil micromorphology naples 2001...

december 12th, 2001 pioneering 300 - infineon technologies

hohokam 10 east park - loopnet · 2019. 5. 29. · 2465...

public notice - federal communications commission notice 445...

rapid object detection using a boosted cascade of simple...

5th intensive course on soil micromorphology naples 2001...

andrei robachevsky. 12th apnic open plicy meeting, august...

sep 12th, 2001: prophetic alert & advisery ca, ore, wa