[ieee [proceedings] 1992 ieee international symposium on circuits and systems - san diego, ca, usa...

I I

EFFICIENT PARALLEL CIRCUIT SIMULATION USING BOUNDED-CHAOTIC RELAXATION

D.J. Zukowski T.A. Johnson

IBM Thomas J. Watson Research Center

Yorktown Heights, N.Y.

Abstract

WR- V256, an experimental wwform-relaxation-based parallel circuit simulator for the Victor V256 distributed-memory parallel machine, was used to s t d y performance traak-offs between Gauss- Seidel and bounded-chaotic relavation algorithms. Sewral sub-circuit scheduling alternatives within the bounded-chaotic framework were also investigated. We haw exercised our simulator on a suite of circuits ranging from 16,000 to owr 93,000 FETs. Several of the circuits were extracted directly from a I6 Mbit DRAM memory design.

1. Introduction

Throughout the late SO’S, much research was undertaken to determine how best to use a parallel processor to speed up circuit simulation. A general consensus has evolved that relaxation methods[l, 21, useful for digital MOS circuits, are the best choice for highly parallel machines. One of the central issues of parallel waveform-relaxation-based simulators for such circuits is the choice of relaxation algorithms. A Gauss-Seidel algorithm generally minimizes the number of waveform iterations but often limits parallel behavior for highly parallel machines. A Gauss-Jacobi algorithm is capable of exploiting parallelism but generally increases the number of iterations needed to achieve the same accuracy as Gauss-Seidel. For those circuits with sufficient inherent parallelism, the execution time for Gauss-Jacobi may actually be longer due to the additional iterations required by the algorithm.

Most of the research on parallel relaxation-based simulators since 1984 has centered on three areas of study: 1) relaxation methods (WR, WRN, ITA, etc.)[3-5], 2) relaxation algorithms (Gauss-Seidel, Gauss-Jacobi, etc.)[4-8], and pipelining of data and iterationsr6, 91. Many of these earlier studies have been limited to small circuits with l i i t e d parallelism. Currently, with the advent of larger and more easily programmable machines, much of this research is being extended. This investigation is intended as one of those extensions, and is targeted towards algorithms for the full waveform-relaxation method.

In this paper, we use the parallel circuit simulator, WR-V256, developed for the Victor V256 distributed-memory parallel processor [ lo] to investigate the impact of a compromise relaxation algorithm based on chaotic relaxation. Both Victor and WR-V256 are summarized in Section 2. WR-V256 uses a pure data-driven paradigm to handle communication of waveforms among sub- circuits. The scheduling of sub-circuits is determined in part by the arrival of updated waveforms. In addition, three sub-circuit sched-

uling alternatives have been implemented to enable chaotic relaxation, i.e. to allow a sub-circuit with old waveforms to run should no sub-circuit have all of its inputs updated.

Production circuits ranging in size from 16,000 to 93,000 FETs have been analyzed with the three variants of chaotic relaxation and with Gauss-Seidel relaxation. These circuits, presented in Section 3, were analyzed with up to 256 processing nodes and represent larger experiments than have generally been published before. The results of these experiments are given in Section 4. Section 5 summarizes the main conclusions of this investigation.

2. The WR-V256 Simulator

TOGGLE, an experimental serial waveform-relaxation circuit simulator[ll], was ported to the Victor V256 distributed-memory MIMD machine. Some extensions to the TOGGLE port were then added to increase the robustness of parallel behavior for a larger class of circuits. These extensions provided the basis of the experiments. For full details of WR-V256, see[l2].

Victor V256 interconnects 256 processor nodes in a 16x16 mesh and connects the top and bottom rows of the mesh with a row of 16 disks. A Victor processor node consists of one transputer with 4 Mbytes of DRAM. A Victor disk node consists of one transputer with 1 MByte of SRAM and a SCSI interface to either a 300 or a 600 Mbyte disk. (A transputer is an INMOS microprocessor that integrates serial communication links (i.e. channels) into its architec- ture and that has hardware support for processes.) Victor can be partitioned to allow upto 4 concurrent independent users. A total of 1 GByte of main memory and 10 GBytes of disk storage are available.

TOGGLE is highly optimized for its serial environment. An efficient Gauss-Seidel algorithm is used that minimizes the number of iterations to around 5 for typical circuits. In addition, it reduces the amount of work per iteration because it enables the analysis of sub- circuits that have converged to be skipped unless one or more of their inputs change. The Gauss-Seidel algorithm is enforced by leveling the sub-circuits and ordering them, by level, on a centralized dispatch queue. The leveling is determined by the interconnections among the sub-circuits.

Sub-circuits and their referenced circuit nodes are distributed statically across all Victor nodes to reduce communication and to approximate uniform load balance. Communication is reduced by use of a method similar to partitioning by element strings[l3]. Sequential chains of sub-circuits are generated based on sub-circuit

91 1

0-7803-0593-Ol92 $3.00 1992 IEEE

interconnections, i.e. all sub-circuits in a chain are connected, and contain at most one sub-circuit per level. Therefore, only one sub- circuit per chain is capable of running at a time. Chains may have a maximum length equal to the number of levels, unless a chain becomes too large to fit in the memory of a single processor. When this occurs, chains are cut to enable distribution across more than one processor. Typically, multiple chains can be allocated to each processor. Chains are grouped together such that FET count per group is reasonably uniform. One group is then allocated to each Victor node. Note that an equal number of FETs per group does not guarantee that all groups will require the same CPU time. This is a first order approximation based on observed behavior that indicates model evaluation dominates solution time for small sub- circuits. For large sub-circuits the effectiveness of FET count decreases since analysis time could be dominated by matrix solution. While dynamic allocation is possible, at this time any added performance does not appear to warrant the added complexity it intro- duces.

One complication of implementing a Gauss-Seidel algorithm efficiently for a distributed-memory machine is that the centralized dispatch queue that enforces Gauss-Seidel ordering must be split into many independent queues. Maintaining a single queue would intro- duce a bottleneck, especially for highly parallel machines. There- fore, the sub-circuit dispatch queue has been distributed to reflect the static loading described above. Each Victor node may have sub- circuits from every level, and the data dependencies that were enforced by the serial queue in the serial TOGGLE are now lost. A data-driven approach is used to handle interactions among sub- circuits. As new waveforms are generated, they are sent to all of the Victor nodes that contain sub-circuits that reference them. A scheduling method chooses which sub-circuit on a Victor node to analyze next. There are four scheduling methods supported by WR-V256. One implements true Gauss-Seidel, and therefore gives numerical behavior identical to the serial version. The other three implement different chaotic relaxations.

1 . From the list of sub-circuits remaining to be solved, select the one with the highest percentage of "known" inputs. That is, solve the sub-circuit with the highest percentage of its inputs defined for the current iteration. There will be at least one input for which data will have to be used from the previous iteration. Hopefully, the "known" inputs will dominate the solution, and the analysis will not suffer from the small percentage of data used from a previous iteration.

L. Select the sub-circuit that would have been solved next if all of the input data had been available for this iteration. If two or more sub-circuits occur at the same Gauss-Seidel level, then select from these the sub-circuit with the highest percentage of "known" input waveforms. This approach essentially maintains the Gauss-Seidel ordering even though the strict data dependencies are compromised.

3. Select the sub-circuit which would be solved last if Gauss-Seidel ordering were maintained. This approach selects a sub-circuit that has the least effect on other sub-circuits in that it minimizes the propagation of "bad" data.

3. Test Cases

To test the effect of the relaxation algorithms and the finer- grained scheduling methods described earlier, we chose the five circuits shown in Table 1 and have run them on 64, 128, and/or 256 Victor nodes. The BLOK4, ECC, QUAD, and SPINE circuits are parts of a 16 Mbit DRAM. The ALUlK circuit was developed to test the original serial TOGGLE code, and was never part of a product. Of the test cases, only ALUlK is free of feedback. BLOK4 has only short feedback paths that are merged into single sub-circuits. The other three contain both merged and cut feedback paths.

For chaotic relaxation, inputs may be used from the current or any previous iteration. To detect convergence, many copies of waveforms may be needed. Thus, a full implementation is costly, especially in its use of memory. However, an implementation with more reasonable memory use is possible as long as some bound is placed on the number of previous iterations from which inputs may be drawn. We have implemented a bounded-chaotic algorithm with a bound of one previous iteration. In other words, it may act like a Gauss-Seidel algorithm or a Gauss-Jacobi algorithm, or somewhere in between, depending on the order of sub-circuit solution. Its memory requirements are the same as those of Gauss-Seidel. For each of the chaotic scheduling methods described below, all sub- circuits that can be solved Gauss-Seidel are analyzed first. Only when a processor runs out of these sub-circuits, is one of the remaining sub-circuits selected. Additionally, the inputs for a sub- circuit must have been updated by at least one iteration before the sub-circuit is again analyzed. Thus, when an input waveform of a sub-circuit is available before the sub-circuit is analyzed, that input must be available before the sub-circuit is analyzed for all remaining iterations. The implication of this restriction is that an analysis will become more "Gauss-Seidel-lie" as iterations continue. This additional restriction is sufficient to guard against false convergence, and is reasonable due to the typically small number of iterations. Note that the same convergence criteria has been used throughout. Each method is described below:

In addition to the size data given in Table 1, we have used a circuit's parallel signature to better understand some of the observed behaviors. A parallel signature indicates the sustainable amount of parallelism inherent in a circuit, for Gauss-Seidel relaxation. Parallel signatures like those shown in Figures 1 - 4, are determined by the interconnection among sub-circuits. The parallel signature for QUAD (not shown) is similar in shape to ECC. The x-axis repres- ents the maximum number of Gauss-Seidel levels in a circuit. The y-axis shows the average number of sub-circuits per Victor node that are assigned to each level. Rather than assign a sub-circuit to only one level, e.g. the earliest level in which it could run for Gauss- Seidel relaxation, the parallel signatures spread each sub-circuit across all of the levels in which it could run without causing other sub-circuits to wait. One drawback of the signature, in this form, is that it contains no information about the relative compute requirements of each sub-circuit. The information for the circuits given in Table 1 and shown in the figures provide the foundation to better understand the experimental results.

91 2

Figure 1. ALUlK parallel signature. Figure 3. SPINE parallel signature.

4. Results

ECC

Figure 4. BLOK4 parallel signature. Figure 2. ECCparallelsignature.

In general, one would expect circuits to run more efficiently Gauss-Seidel when they have parallel signatures that display a sustained number of sub-circuits per level per Victor node somewhere greater than 1.0. In such cases, all processors should have enough work that could be solved Gauss-Seidel for the entire analysis. The

Table presents the run times, in seconds, for all of the jobs, and also shows these times relative to the corresponding Gauss-Seidel times to better understand the Gauss-Seidel vs. chaotic behaviors. We show similar improvements over Gauss-Seidel for circuits with insufficient parallelism as those shown by [6] without resorting to a full Gauss-Jacobi implementation. Therefore, the bounded-chaotic algorithm appears to allow a similar amount of parallel execution that a Gauss-Jacobi does, for a circuit,machine mix with little Gauss-Seidel parallelism. On the other hand, for those circuits that exhibit exceptional parallel Gauss-Seidel behavior and

ALUlK circuit has such a parallel signature and its performance, regardless of the scheduling method, is essentially like that of Gauss- Seidel relaxation, The ECC, QUAD, and SPINE suggest different parallel behavior, which is Circuits ECC and QUAD both exhibit relatively high parallel behavior at the beginning of an analysis (early Gauss-Seidel levels), and then show sigmficant sequential dependencies for the majority Of the remaining levels. In addition, both test cases contain a large number of levels. Gauss-Seidel serialization forces many of the processors to be idle during the sequential period, which could be a large percentage of the total analysis time, Hence, the chaotic methods out-perform Gauss-Seidel for these

that would therefore only suffer with Gauss-Jacobi analysis due to extra iterations, we show near Gauss-Seidel performance. Note that program behavior among the runs, even those of the same circuit and sub-circuit scheduling method, is different. Therefore, the tradi- tional speedup numbers typically seen for parallel machines are not appropriate here. Apparent anomalies, like the "super-linear" speedup seen between the 64 and 128 processor QUAD runs, both with scheduling method 3, are due to the inherent randomness of the bounded-chaotic algorithm.

The parallel signature of SPINE is considerably different from those discussed previously. Unlike the others that started with a high degree of parallelism, SPINE shows very little parallel behavior during the initial levels. At about the halfway point, more parallel behavior is suggested. This implies that sub-circuits are more likely to be solved Gauss-Seidel as the analysis progresses, especially for the 64 and 128 processor runs. However, the non-Gauss-Seidel relaxation at the beginning of the analysis could propagate incorrect waveforms that may need many additional iterations to correct. With 256 processors, the level of sustained parallelism is small, and

91 3

I

Table 2. Test case performance data. Run time: raw (CPU seconds) and normalized to

Gauss-Seidel i

SPINE (256)

SPINE (128) SPINF

1111 1078 0.97 898 0.81 1063 O.%

1168 1895 1.62 1468 1.26 912 0.78

1591 1519 0.95 2039 1.28 1395 0.88 ~ ~ . .

BLOK4(256) 1 815 I 1018 I 1.25 I 1035 1 1.27 1 990 I 1.21

BLOK4(128) 1 1603 1 1824 I 1.14 I 1790 I 1.12 I 1754 I 1.09

The BLOK4 circuit appears to contradict the use of parallel signatures to predict parallel performance. BLOK4 is a pathological circuit that exhibits very uneven load balance. The sub-circuits in this problem range in size from 1 FET, 1 node to 1148 FETs, 157 nodes (of which there are 139). Also, even though the BLOK4 circuit has the largest FET count, it contains the fewest number of sub-circuits. Any increase in parallelism is small when compared to the more than three fold increase in the number of iterations, when chaotic relaxation is used. Some additional partitioning/load bal- ancing work must be done for such a complex circuit.

5. summary

In summary, the bounded-chaotic algorithm presented above offers a reasonable compromise for a relaxation algorithm used for parallel machines. It offers one way to attain both the numerical efficiency of Gauss-Seidel for larger jobs and the Gauss-Jacobi parallelism for smaller ones. At this point, there is no clear choice of sub-circuit scheduling method among the three that were investigated for the bounded-chaotic algorithm. Rather, the best schedule for a circuit appears to depend on many factors including circuit topology and feedback. Further research like that begun in [8] is needed to increase the robustness of results for more complex circuits, especially those with a high degree of coupling among sub- circuits. In addition to the increased parallelism due to the loosening of Gauss-Seidel serialization, better overall efficiency is possible due to the better use of available memory. When more FETs are able to be analyzed, generally there is more that can be done in parallel irregardless of the chosen relaxation algorithm and hence parallel efficiency improves[l2]. With both an ability to run more FETs and an algorithm that adjusts to limited parallelism, the bounded-chaotic algorithm is a powerful alternative for parallel circuit simulation.

References

J. White and A.L. Sangiovanni-VincenteU, Relaxation Techniques for the Simulation of VLSI Circuits, Kluwer Academic Publishers, 1986.

A. RueNi, Circuit Analysir, Simulation and Design, North- Holland, 1987.

J.T. Deutsch and A.R. Newton, “MSPLICE: A Multiprocessor-based Circuit Simulator,” Int. Conf Pur- allel Processing, pp. 207-214, May 1984.

A. Sangiovanni-Vincentelli, Parallel Processing and Applications, Elsevier Science Publishers B.V. (North- Holland), 1988.

R. Saleh, K. Gallivan, M. Chang, I. Hajj, D. Smart, and T. Trick, “Parallel Circuit Simulation on Supercomputers,” Proceedings of the IEEE, vol. 77, no. 12, pp. 1915-1913, Dec 1989.

D.W. Smart and T.N. Trick, “Waveform Relaxation on Parallel Processors,” Int. Joum. of Circuir l7te0ry and Appl., vol. 16, pp. 447-456, 1988.

Y-C. Wen, K. Gallivan, and R. Saleh, “Parallel Event- Driven Waveform Relaxation,” IEEE Int. Conf on Com- puter Design: VLSI in Computers, pp. 101-104, Oct 1991.

L. Peterson and S. Mattisson, “Circuit Simulation on a Hypercube,” IEEE Int. Symp. on Circuits and Systems, pp. 1119-1 122. 1988.

P. Odent, L. Claesen, and H. De Man, “Acceleration of Relaxation-Based Circuit Simulation Using a Multi- processor System,” IEEE Trans. on Computer-Aided Design, vol. 9, no. 10, pp. 1063-1072, 1990.

D.G. Shea, W.W. Wilcke, R.C. Booth, D.H. Brown,, Z.D. Christidis, M.E. Giampapa, G.R. Irwin, T.T. Murakami, V.K. Naik, F.T. Tong, P.R. Varker, and D.J. Zukowski, “The IBM Victor V256 Multiprocessor,” to appear in: IBM Journal of Research and Development, Nov 1991.

T.J. LeBlanc, T.J. Cockerill, P.J. Ledak, H.Y. Hsieh and A.E. Ruehli, “Recent Advances in Waveform Relaxation Based Circuit Simulation,” IEEE Inr. Conf on Computer Design: VLSI in Computers, pp. 594-596, Oct 1985.

T.A. Johnson and D.J. Zukowski, “Waveform-Relaxation-Based Circuit Simulation on the Victor V256 Parallel Processor,” to appar in: IBM Journal of Research and Dewlopment, NOV 1991.

Y.H. Levendel, P.R. Menon, and S.H. Patel, “Special Purpose Computer for Logic Simulation Using Distributed Processing,” Bell System Technical Journal, vol. 61, no. 10, pp. 2873-2909, Dec 1982.

I

91 4

[ieee [proceedings] 1992 ieee international symposium on circuits and systems - san diego, ca, usa...

Documents