ieee transactions on very large scale integration...

11
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Deterministic Random Walk: A New Preconditioner for Power Grid Analysis Jia Wang, Member, IEEE, Xuanxing Xiong, Member, IEEE, and Xingwu Zheng Abstract— Iterative linear equation solvers rely on high-quality preconditioners to achieve fast convergence. For sparse symmet- ric systems arising from large power grid analysis problems, however, preconditioners generated by traditional incomplete Cholesky factorization are usually of low quality, resulting in slow convergence. On the other hand, preconditioners generated by random walks are quite effective to reduce the number of iterations, though requiring considerable amount of time to compute in a stochastic manner. We propose in this paper a new preconditioning technique for power grid analysis, named deterministic random walk that combines the advantages of the above two approaches. Our proposed algorithm computes the preconditioners in a deterministic manner to reduce computation time, while achieving similar quality as stochastic random walk preconditioning by modifying fill-ins to compensate dropped entries. We have proved that for such compensation scheme, our algorithm will always succeed, which otherwise cannot be guar- anteed by traditional incomplete factorizations. We demonstrate that by incorporating our proposed preconditioner, a conjugate gradient solver is able to outperform a state-of-the-art algebraic multigrid preconditioned solver for dc analysis, and is very efficient for transient simulation on public IBM power grid benchmarks. Index Terms— Power gird analysis, preconditioned conjugate gradient, preconditioner, random walks. I. I NTRODUCTION P OWER supply noise is a critical issue that must be addressed in modern chip designs since it may affect important design metrics including timing and power con- sumption [5], [10], [18], [35]. Modeling devices as current sources and power distribution networks with parasitics as RLC grids, power grid analysis techniques [10], [28] compute power supply noises under given current excitations by solving one or more systems of linear equations [16] depending on whether the steady-state dc analysis or the time-domain tran- sient simulation should be performed. The noises are verified against certain bounds that help to decompose power grid verification from timing verification [33], and the designers could take further actions if the outcomes are not satisfactory. The key problem of power grid analysis can be formulated as the following sparse system of linear equations, where the Manuscript received July 30, 2013; revised February 28, 2014 and August 29, 2014; accepted October 8, 2014. This work was supported by the Illinois Institute of Technology, Chicago, IL, USA. J. Wang and X. Zheng are with the Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616 USA (e-mail: [email protected]; [email protected]). X. Xiong was with the Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616 USA. He is now with the Design Group, Synopsys, Inc., Mountain View, CA 94538 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TVLSI.2014.2365409 noise vector x should be obtained given the system matrix A and the excitation vector b: Ax = b. (1) In this paper, we consider dc analysis and transient simulation when there is no mutual inductance, and thus assume A to be a sparse nonsingular symmetric diagonally dominant matrix with nonpositive off-diagonal elements [6]. Note that these also implies that A is symmetric positive definite (s.p.d.). Despite extensive studies, it remains extremely chal- lenging [29] to solve the above key problem to match the increasing complexity of power grids: a single system may contain more than one million unknown variables in recent public benchmarks [24], [25], [29], and capacitive couplings between VDD and GND grids and inductances in wires further complicate transient simulation. At such large problem sizes, direct solvers like Cholesky factorization [7], [14] are inefficient, and are prohibitive in memory usage without introducing a hierarchy [45]. The state-of-the-art research on power grid analysis can be separated in two groups. The efforts of the first group lead to efficient algorithms that can obtain approximated solutions by exploiting structures of the powergrid, e.g., hierarchical analysis [45], multigrid [11], [20], [28], [36], [49], [50], random walks [23], [31], Poisson solver [34], frequency domain methods [17], [46], and model order reduction techniques [21], [22], [39]. The second group [6], [8], [12], [32], [37], [41], [42], [47], [48] utilizes iterative solvers, especially preconditioned conjugate gradient (PCG) as most systems are s.p.d., to achieve better solution accuracy. It is well known that the quality of the precondition- ers has a huge impact on the performance of the iterative solvers. However, for large problems arise from power grid analysis, obtaining a good preconditioner remains very chal- lenging. Among the traditional preconditioning techniques, including incomplete Cholesky (IC) [27], support graph [3], approximated inverse [2], and algebraic multigrid (AMG) methods [30], recent published results for the aforemen- tioned benchmarks [25], [29] show that only an AMG–PCG solver [42] has shown significant improvement, while other preconditioning techniques [8], [41], [47] are less effective, requiring a large number of iterations toward convergence. On the other hand, the stochastic preconditioning tech- nique [32] based on random walks is very effective to reduce the number of PCG iterations. Qian and Sapatnekar [32] pointed out, the superior quality comes from the fact that the columns of the factor are computed independently so that errors will not accumulate as in traditional incom- plete Cholesky (IC)/incomplete LU (ILU) preconditioners. 1063-8210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 23-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …mypages.iit.edu/~xxiong3/files/wang_tvlsi14.pdf2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS However, since

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Deterministic Random Walk: A New Preconditionerfor Power Grid Analysis

Jia Wang, Member, IEEE, Xuanxing Xiong, Member, IEEE, and Xingwu Zheng

Abstract— Iterative linear equation solvers rely on high-qualitypreconditioners to achieve fast convergence. For sparse symmet-ric systems arising from large power grid analysis problems,however, preconditioners generated by traditional incompleteCholesky factorization are usually of low quality, resulting inslow convergence. On the other hand, preconditioners generatedby random walks are quite effective to reduce the numberof iterations, though requiring considerable amount of time tocompute in a stochastic manner. We propose in this paper anew preconditioning technique for power grid analysis, nameddeterministic random walk that combines the advantages of theabove two approaches. Our proposed algorithm computes thepreconditioners in a deterministic manner to reduce computationtime, while achieving similar quality as stochastic random walkpreconditioning by modifying fill-ins to compensate droppedentries. We have proved that for such compensation scheme, ouralgorithm will always succeed, which otherwise cannot be guar-anteed by traditional incomplete factorizations. We demonstratethat by incorporating our proposed preconditioner, a conjugategradient solver is able to outperform a state-of-the-art algebraicmultigrid preconditioned solver for dc analysis, and is veryefficient for transient simulation on public IBM power gridbenchmarks.

Index Terms— Power gird analysis, preconditioned conjugategradient, preconditioner, random walks.

I. INTRODUCTION

POWER supply noise is a critical issue that must beaddressed in modern chip designs since it may affect

important design metrics including timing and power con-sumption [5], [10], [18], [35]. Modeling devices as currentsources and power distribution networks with parasitics asRLC grids, power grid analysis techniques [10], [28] computepower supply noises under given current excitations by solvingone or more systems of linear equations [16] depending onwhether the steady-state dc analysis or the time-domain tran-sient simulation should be performed. The noises are verifiedagainst certain bounds that help to decompose power gridverification from timing verification [33], and the designerscould take further actions if the outcomes are not satisfactory.

The key problem of power grid analysis can be formulatedas the following sparse system of linear equations, where the

Manuscript received July 30, 2013; revised February 28, 2014 andAugust 29, 2014; accepted October 8, 2014. This work was supported bythe Illinois Institute of Technology, Chicago, IL, USA.

J. Wang and X. Zheng are with the Department of Electrical and ComputerEngineering, Illinois Institute of Technology, Chicago, IL 60616 USA (e-mail:[email protected]; [email protected]).

X. Xiong was with the Department of Electrical and Computer Engineering,Illinois Institute of Technology, Chicago, IL 60616 USA. He is now withthe Design Group, Synopsys, Inc., Mountain View, CA 94538 USA (e-mail:[email protected]).

Digital Object Identifier 10.1109/TVLSI.2014.2365409

noise vector x should be obtained given the system matrix Aand the excitation vector b:

Ax = b. (1)

In this paper, we consider dc analysis and transient simulationwhen there is no mutual inductance, and thus assume Ato be a sparse nonsingular symmetric diagonally dominantmatrix with nonpositive off-diagonal elements [6]. Note thatthese also implies that A is symmetric positive definite(s.p.d.). Despite extensive studies, it remains extremely chal-lenging [29] to solve the above key problem to match theincreasing complexity of power grids: a single system maycontain more than one million unknown variables in recentpublic benchmarks [24], [25], [29], and capacitive couplingsbetween VDD and GND grids and inductances in wiresfurther complicate transient simulation. At such large problemsizes, direct solvers like Cholesky factorization [7], [14] areinefficient, and are prohibitive in memory usage withoutintroducing a hierarchy [45]. The state-of-the-art research onpower grid analysis can be separated in two groups. Theefforts of the first group lead to efficient algorithms thatcan obtain approximated solutions by exploiting structures ofthe powergrid, e.g., hierarchical analysis [45], multigrid [11],[20], [28], [36], [49], [50], random walks [23], [31], Poissonsolver [34], frequency domain methods [17], [46], and modelorder reduction techniques [21], [22], [39]. The second group[6], [8], [12], [32], [37], [41], [42], [47], [48] utilizes iterativesolvers, especially preconditioned conjugate gradient (PCG) asmost systems are s.p.d., to achieve better solution accuracy.

It is well known that the quality of the precondition-ers has a huge impact on the performance of the iterativesolvers. However, for large problems arise from power gridanalysis, obtaining a good preconditioner remains very chal-lenging. Among the traditional preconditioning techniques,including incomplete Cholesky (IC) [27], support graph [3],approximated inverse [2], and algebraic multigrid (AMG)methods [30], recent published results for the aforemen-tioned benchmarks [25], [29] show that only an AMG–PCGsolver [42] has shown significant improvement, while otherpreconditioning techniques [8], [41], [47] are less effective,requiring a large number of iterations toward convergence.On the other hand, the stochastic preconditioning tech-nique [32] based on random walks is very effective to reducethe number of PCG iterations. Qian and Sapatnekar [32]pointed out, the superior quality comes from the fact thatthe columns of the factor are computed independently sothat errors will not accumulate as in traditional incom-plete Cholesky (IC)/incomplete LU (ILU) preconditioners.

1063-8210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …mypages.iit.edu/~xxiong3/files/wang_tvlsi14.pdf2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS However, since

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

However, since to compute the columns independently requiresto perform Monte Carlo simulations on the power grid, evenwhen the walks are extensively reused [32], it takes muchmore time than other techniques [19], [27], [47] that computethe factors in a rather deterministic manner. Stochastic randomwalk preconditioning is, therefore, not suitable when the sys-tem is only solved for a few times, e.g., for dc analysis. Nev-ertheless, since the preconditioner itself is an approximatedroot-free Cholesky (LDLT ) factorization of the system matrix,each PCG iteration is simpler than AMG preconditioning[42], making it desirable to study new factorization-basedpreconditioners.

In this paper, we propose a new preconditioner, nameddeterministic random walk (DRW) that overcomes the abovedifficulty, while achieving similar quality compared to stochas-tic random walk preconditioning. Our key observation is that ifthe columns of the factor are computed sequentially, a randomwalk can be decomposed into a few random walks whoseprobabilities are either easily known or already computed.This enables us to avoid most, if not all, Monte Carlo simula-tions, resulting in a deterministic algorithm that computes theapproximated factor efficiently. As our algorithm will usuallylocate more entries in the factor than stochastic random walk,we propose to keep only the largest few in absolute value percolumn to maintain its sparsity, and to compensate the droppedones by decreasing the remaining ones. Such compensationscheme differs our preconditioner from various traditionalIC/ILU preconditioners [15], [26], [27], where the off-diagonalelements of the factor are only allowed to be increased ordropped (increased to 0) to guarantee the correctness of thealgorithm for the system matrix being a symmetric M-matrix.As a comparison, by leveraging random walks, we are ableto prove the correctness of our algorithm when more flexibledropping and compensation schemes are adopted. In addition,our proposed preconditioner differs from hierarchical randomwalks [23], [31] in the sense that they explicitly approximatethe possibly dense Schur complement after partial factorizationby Monte Carlo simulations, while we implicitly factor theSchur complement without actually generating it first.

We demonstrate that a PCG solver using our proposed DRWpreconditioner can efficiently explore the tradeoffs betweenthe preconditioner size, the time to compute the precondi-tioner, and the time to solve a problem iteratively. Using apreconditioner size similar to that of the system matrix A,we reduce the time to compute the preconditioner withoutdegrading its quality dramatically and obtain an efficientsolver for dc analysis that is able to outperform the state-of-the-art AMG–PCG solver PowerRush [42]. Moreover, incomparison to stochastic random walk preconditioning, usinga preconditioner of similar size, we reduce the time to computeit by 6.8× and the solver is 2.4× faster for IBM public powergrid transient simulation benchmarks [24]. Finally, when morefill-ins are allowed for transient simulation, our iterative solveris able to complete all the six benchmarks within an hour,while a direct solver would need at least 20 min but requiring5× fill-ins.

The rest of this paper is organized as follows. Incompletefactorization and stochastic random walk preconditioning are

Fig. 1. (a) Matrix and its extended matrix graph. (b) Exact LDLT

factorization and the probabilities/expectations of the random walks.

reviewed in Section II. Our proposed DRW preconditioneris presented in Section III. Implementation details are dis-cussed in Section IV. After experimental results are reportedin Section V, Section VI concludes this paper.

II. PRELIMINARIES

A. Matrices and Graphs

For an n × 1 vector x, we denote its i th element by xi for1 ≤ i ≤ n, and a vector obtained from its i th to j th elementsby xi: j . For an n × n matrix X , we denote its element on therow i and the column j by xi, j for 1 ≤ i ≤ n and 1 ≤ j ≤ n.The submatrix of X spanning the rows i1 to i2 and the columnsj1 to j2 is denoted by Xi1 :i2, j1: j2 . We opt to omit :i2 or : j2 ifi1 = i2 or j1 = j2, respectively.

We define D(X) to be the diagonal of X , i.e.,diag(x1,1, x2,2, . . . , xn,n). We say that X is column-wise diag-onally dominant if ∀k, xk,k ≥ ∑

i �=k |xi,k |, and that X is unitlower triangular if ∀k, xk,k = 1, and X1:k−1,k = 0.

We define the extended matrix graph of X to be a weighteddirected graph G∗

X = (V ∗X , E∗

X , w∗X ). The vertex set V ∗

Xconsists of the vertices 1, 2, . . . , n corresponding to the rowsand columns of X1 and an additional vertex n +1. For anyxi, j �= 0 where i �= j , an edge ( j, i) is introduced to E∗

Xwith the weight w∗

X ( j, i) = −xi, j . For any k satisfying xk,k �=∑i �=k xi,k , an edge (k, n +1) is introduced with the weight

w∗X (k, n+1) = xk,k − ∑

i �=k xi,k . Similarly, an edge (n + 1, k)is introduced with the weight w∗

X (n+1, k) = xk,k −∑j �=k xk, j

if the weight is not zero. An example system matrix A and itsextended matrix graph are shown in Fig. 1(a).

B. Incomplete LDLT Factorization

We review incomplete Cholesky factorization in its root-free forms, i.e., incomplete LDLT factorization, because ofits close relationship to random walk-based preconditioningtechniques studied in [32] and in this paper.

Let n be the dimension of the system matrix A. Since Ais s.p.d., it is well known [14] that there exists a unit lowertriangular matrix L̂ and a diagonal matrix D̂ with all diagonalelements being positive such that A = L̂ D̂ L̂T . Since A isalso an M-matrix,2 its incomplete LDLT factorization does

1Our definition of G∗X implicitly depends on the matrix ordering of X .

2An M-matrix is a matrix where the elements in its inverse are allnonnegative.

Page 3: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …mypages.iit.edu/~xxiong3/files/wang_tvlsi14.pdf2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS However, since

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: DETERMINISTIC RANDOM WALK 3

exist [27] and can be computed iteratively for k = 1, 2, . . . , nas follows:

dk,k = ak,k −k−1∑

j=1

lk, j d j, j lk, j (2)

li,k = 0 or1

dk,k

⎝ai,k −k−1∑

j=1

lk, j d j, j li, j

⎠ ∀k < i ≤ n. (3)

The choices in (3) are determined by the dropping scheme,see [19]. According to [26], the relationship between the exactand the incomplete LDLT factorization is

dk,k ≥ d̂k,k > 0, 0 ≥ li,k ≥ l̂i,k , 0 ≥ dk,kli,k ≥ d̂k,k l̂i,k . (4)

Notably, the first inequality ensures the correctness of thealgorithm. However, the above inequalities also indicate thatthe error due to dropped entries will accumulate and result ina systematic difference between the exact and the incompletefactorization, which will cause incomplete Cholesky precon-ditioners to perform unfavorably in a PCG solver. Whileone may come up with an intuitive idea to correct suchdifferences by decreasing both dk,k and li,k during factorizationfor compensation, we need to point out that attempting to doso may break the guarantee that the computation could proceed(dk,k �= 0), and the guarantee that the obtained factorizationwill remain s.p.d. (dk,k > 0), as required by any PCG solver.

C. Stochastic Random Walk Preconditioning

Consider the matrix T = AD(A)−1, i.e., the matrix obtainedfrom A by dividing its column k by ak,k for every k. A randomwalk game can be defined in the extended matrix graph G∗

Ttreating the edge weights as the transition probabilities. For1 ≤ k < i ≤ n+1, let k

≤k� i be the set of the random walksfrom k to i whose internal vertices, if there is any, are allno more than k, and denote their probability by Pr[k ≤k� i ].Consider all such random walks from a particular k, i.e.,⋃n+1

i=k+1 k≤k� i . Since A is symmetric nonsingular, n +1 is

reachable from k. It implies that∑n+1

i=k+1 Pr[k ≤k� i ] = 1 andthat the expectation Ek of the number of the times that onesuch random walk passes k is finite.

The above definition of random walks is adopted from [32]with the modification that the vertices are numberedreversely—the vertices 1, 2, . . . , n as defined above are thesame set of vertices defined as n, . . . , 2, 1 in [32]. While theconclusion of [32] states that the outcomes of their randomwalks are the entries in the LDLT factorization of the matrixobtained by reversing A’s columns and rows, this modificationallows us to rewrite the conclusion as that the outcomes of theabove random walks are the entries in the LDLT factorizationof A. To be more specific, we have

d̂k,k = ak,k

Ek∀1 ≤ k ≤ n (5)

l̂i,k = −Pr[k ≤k� i ] ∀1 ≤ k < i ≤ n. (6)

The factorization and the probabilities and the expectationsof the system shown in Fig. 1(a) are listed in Fig. 1(b).

Note that (6) can be extended to all 1 ≤ k < i ≤ n +1 ifwe define l̂n+1,k = − ∑n

i=1 l̂i,k .Equations (5) and (6) motivate one to obtain an approxi-

mated LDLT factorization of A via Monte Carlo simulations.When L and D are compared with the outcome of incompleteLDLT factorization as described in (4), an immediate advan-tage is that there is no systematic difference since estimationsof the probabilities and expectations could be either largeror smaller than their exact values, and there is no erroraccumulation.

However, the fundamental difficulty of stochastic randomwalk preconditioning is that the length of a random walkcannot be bounded, and thus the algorithm only terminatesin a probabilistic sense even when one bounds the number ofMonte Carlo simulations. For example, for the system shownin Fig. 1(a), a random walk starting from 4 could pass theedges 1 → 2 and 2 → 1 for unbounded number of timesbefore finally reaching 5. In the extreme case when there aretraps in the system [31], e.g., when the weights of the edges1 → 2 and 2 → 1 are much larger compared with others,it will take excessive time to compute the preconditioner.To overcome such difficulty, one may partition the system [45]and use hierarchical random walks [23], [31] to generatea macromodel of the global grid before applying stochasticrandom walk preconditioning. However, since the macromodelessentially is an approximation of the Schur complement afterpartial factorization, to compute and to store it explicitly couldbe very costly. As a comparison, our proposed preconditionerutilizes the key observation that if the columns of the factorare computed sequentially, a random walk can be decomposedinto a few random walks whose probabilities are either easilyknown or already computed, and thus circumvents the diffi-culty to walk the trap directly to approximate the probability orto generate a macromodel that could be dense. Details follow.

III. PRECONDITIONING WITH DETERMINISTIC

RANDOM WALKS

According to (4)–(6), both incomplete LDLT factorizationand stochastic random walks preserve the following structuresof the exact D̂ and L̂ in their preconditioners D and L.P1: The off-diagonal elements of L are nonpositive and L is

column-wise diagonally dominant.P2: An off-diagonal element of L is zero if the correspond-

ing element of L̂ is zero.P3: The diagonal elements of D are positive.

As we are interested in designing a new preconditioningtechnique to exploit the advantages of both approaches, itis desirable for our preconditioner to meet all the aboveconditions from P1 to P3.

A. Structure of Random Walks

Our first step toward a better preconditioner than stochasticrandom walks is to circumvent the difficulty in bounding thelengths of the random walks. We introduce a new group ofrandom walks defined as follows. For 1 ≤ k < i ≤ n+1, letk

<k� i be the set of the random walks from k to i whoseinternal vertices, if there is any, are all less than k, and let

Page 4: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …mypages.iit.edu/~xxiong3/files/wang_tvlsi14.pdf2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS However, since

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

k<k� k be the set of the random walks from k to k whose

internal vertices are all less than k.3 As we will show later, theprobabilities and the expectations in (5) and (6) can be easilycalculated from the probabilities of these group of randomwalks. Moreover, if we assume the probabilities are computediteratively for k = 1, 2, . . . , n, then any random walk ink

<k� i can be decomposed into at most k random walkswhose probabilities are already known, totally eliminating anycomputation that may need unbounded time.

To compose a longer random walk from shorter ones, letW and U be two sets of random walks and assume that allwalks in W stop at and all walks in U start from the samevertex. We define W + U to be the set of the random walksobtained by concatenating every pair of the random walks, onefrom W and the other from U .

For a random walk r ∈ k≤k� i for 1 ≤ k < i ≤ n +1.

If r passes k for more than once, i.e., r /∈ k<k� i , then we

can uniquely decompose r into two walks r1 and r2 such thatr1 ∈ k

<k� k and r2 ∈ k≤k� i by identifying the first time that

r returns to k. For example, the random walk r = 3 → 1 →3 → 5 in Fig. 1(a) is decomposed as r1 = 3 → 1 → 3 andr2 = 3 → 5. Therefore

k≤k� i = k

<k� i ∪ (k<k� k + k

≤k� i) ∀1 ≤ k < i ≤ n+1. (7)

Such decomposition leads to the following two lemmas.Lemma 1: ∀1 ≤ k < i ≤ n + 1, Pr[k ≤k� i ] =

Pr[k <k� i ]/1 − Pr[k <k� k].Proof: Taking the probabilities of the random walks at

both sides of (7), we have

Pr[k ≤k� i ] = Pr[k <k� i ] + Pr[k <k� k]Pr[k ≤k� i ]which can be rearranged as

Pr[k ≤k� i ](1 − Pr[k <k� k]) = Pr[k <k� i ].Therefore, it is sufficient to prove 1 − Pr[k <k� k] > 0.

Since A is symmetric nonsingular, n+1 is reachable from k.This implies that, first, any random walk from k either returnsto k or reaches a vertex i > k eventually, i.e., 1−Pr[k <k� k] =∑n+1

i=k+1 Pr[k <k� i ]; and second, Pr[k <k� n+1] > 0. As all theprobabilities should be nonnegative, we have

1 − Pr[k <k� k] =n+1∑

i=k+1

Pr[k <k� i ] ≥ Pr[k <k� n+1] > 0

and thus the lemma holds.Lemma 2: ∀1 ≤ k ≤ n, Ek = 1/1 − Pr[k <k� k].

Proof: Taking a union at both sides of (7) for i = k + 1,k + 2, . . . , n+1, we have

n+1⋃

i=k+1

k≤k� i =

(n+1⋃

i=k+1

k<k� i

)

∪(

k<k� k+

n+1⋃

i=k+1

k≤k� i

)

. (8)

Recall that Ek is the expectation of the number of times that arandom walk belonging to the set at the LHS of (8) passes k.

3Since there is no self-loop in the extended matrix graph, a random walk

in k<k� k must contain at least one internal vertex.

Thus, we can calculate Ek based on the decomposition at theRHS of (8).

For the set⋃n+1

i=k+1 k<k� i , each random walk will pass k

exactly once, i.e., the conditional expectation is 1. On the otherhand, for the set k

<k� k + ⋃n+1i=k+1 k

≤k� i , the conditionalexpectation is 1 + Ek . Therefore, (8) implies that

Ek = 1 ·n+1∑

i=k+1

Pr[k <k� i ] + (1 + Ek)Pr[k <k� k].

The lemma thus holds since∑n+1

i=k+1 Pr[k <k� i ] =1 − Pr[k <k� k] > 0, as shown in the proof of Lemma 1.Lemma 1 and 2, together with (5) and (6), show that one canfactor A by computing Pr[k <k� i ] for all 1 ≤ k ≤ i ≤ n,though further decompositions are still necessary to eliminateMonte Carlo simulations.

For a random walk in⋃n+1

i=k k<k� i and a positive integer w

that is no more than the number of the edges in the randomwalk, we define its w-step prefix to be the path consistingits first w edges. It is clear by such definition a random walkcould be its own prefix but cannot be a prefix of other randomwalks. While many random walks can share the same prefix,the probability for a random walk to have a particular prefixequals to the product of the transition probabilities along theedges of the prefix. Consider a set of prefixes Nk such that anyrandom walk in

⋃n+1i=k k

<k� i has exactly one prefix belongingto Nk . Obviously, such set Nk exists—for example, Nk couldbe the set of all the one-step prefixes, i.e., all the one-stepwalks starting from k. Let p be an (n+1)×1 vector where p j

is the probability for the random walks to have a prefix in Nk

ending at the vertex j for 1 ≤ j ≤ n +1. For example, forthe above choice of N3 as all the one-step prefixes {3 → 1,3 → 5} in Fig. 1(a), we have p = (1/2, 0, 0, 0, 1/2).

For k ≤ i ≤ n + 1, a random walk r in k<k� i that is

not itself a prefix in Nk can be uniquely decomposed into aprefix r1 ∈ Nk and a random walk r2, where r1 ends at andr2 starts from some vertex j < k ≤ i . For example, for theaforementioned choice of N3, the random walk r = 3 →1 → 2 → 1 → 2 → 1 → 4 in Fig. 1(a) is decomposed intor1 = 3 → 1 and r2 = 1 → 2 → 1 → 2 → 1 → 4. While theprobability for the random walks to have a prefix in Nk endingat j can be obtained from p as p j , the difficulty here is tocompute the probability of all such r2’s. Our idea is to furtherdecompose r2 into the random walks whose probabilities areknown from previous iterations.

Since r2 starts from j and stops at i > j , we may considerthe first time that r2 reaches a vertex j2 > j . If j2 �= i , thenit must be an internal vertex on r2. Therefore, j2 < k ≤ iand we may consider the first time that r2 reaches a vertexj3 > j2. Note that r2 must reach j2 before j3 since any internalvertices on r2 between j and j2 must be less than j and thusbe less than j2. Clearly, we may continue such reasoning untilr2 reaches its last vertex i . Formally, there exists certain 1 ≤m < k and j = j1 < j2 < · · · < jm < k, such that we candecompose r2 into m walks, one each from j1

≤ j1� j2, j2≤ j2� j3,

. . ., jm−1≤ jm−1� jm , and jm

≤ jm� i . For example, for k = 3 andi = 4, the random walk r2 = 1 → 2 → 1 → 2 → 1 → 4

Page 5: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …mypages.iit.edu/~xxiong3/files/wang_tvlsi14.pdf2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS However, since

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: DETERMINISTIC RANDOM WALK 5

in Fig. 1(a) is decomposed into m = 2 random walks 1 → 2and 2 → 1 → 2 → 1 → 4 with j1 = 1 and j2 = 2. Thisdecomposition leads to the following lemma.

Lemma 3: ∀1 ≤ k ≤ n and k ≤ i ≤ n+1

Pr[k <k� i ] = pi − L̂i,1:k−1 L̂−11:k−1,1:k−1p1:k−1

where we define l̂n+1,k = − ∑ni=1 l̂i,k .

Proof: For 1 ≤ j < k, let j<k� i be the set of the random

walks from j to i whose internal vertices, if there is any, areall less than k. Based on the decomposition of r into r1 and r2for r /∈ Nk , we have

Pr[k <k� i ] = pi +k−1∑

j=1

p j Pr[ j<k� i ]. (9)

Based on the decomposition of r2 into m random walksand (6), we have

k−1∑

j=1

p j Pr[ j<k� i ]

=k−1∑

j1=1

p j1

k− j1∑

m=1

j1< j2<···< jm<k

(m−1∏

t=1

Pr[ jt≤ jt� jt+1]

)

Pr[ jm≤ jm� i ]

=k−1∑

j1=1

p j1

k− j1∑

m=1

j1< j2<···< jm<k

(m−1∏

t=1

(−l̂ jt+1, jt )])

(−l̂i, jm )

=k−1∑

m=1

1≤ j1< j2<···< jm<k

(−l̂i, jm )

(m−1∏

t=1

(−l̂ jt+1, jt )

)

p j1

=k−1∑

m=1

(−L̂i,1:k−1)(I − L̂1:k−1,1:k−1)m−1p1:k−1

= −L̂i,1:k−1

(k−1∑

m=1

(I − L̂1:k−1,1:k−1)m−1

)

p1:k−1.

Since I − L̂1:k−1,1:k−1 is a (k −1) × (k −1) lower triangularmatrix whose diagonal elements are all 0, we have

k−1∑

m=1

(I − L̂1:k−1,1:k−1)m−1 = L̂−1

1:k−1,1:k−1

and thus the lemma holds according to (9).As an example of Lemmas 1–3, consider the aforementioned

choice of N3 being the set of all the one-step walks startingfrom the vertex 3 in Fig. 1(a). Then

Pr[3 <3� 4]= p4 + p1Pr[1 <3� 4] + p2Pr[2 <3�4]= p4 + p1(Pr[1 ≤1�4] + Pr[1 ≤1�2]Pr[2 ≤2�4]) + p2(Pr[2 ≤2�4])= p4 + (Pr[1 ≤1�4] Pr[2 ≤2�4])

(

I +(

0 0

Pr[1 ≤1�2] 0

)) (p1p2

)

= p4 + (Pr[1 ≤1�4] Pr[2 ≤2�4])(

1 0

−Pr[1 ≤1�2] 1

)−1 (p1p2

)

= 0 +(

1

3

1

2

)⎛

⎝1 0

−1

31

−1 ⎛

⎝1

20

⎠ = 1

4.

Fig. 2. DRW algorithm.

Similarly

Pr[3 <33� 3]= p3+ (Pr[1 ≤1� 3]Pr[2 ≤2� 3])

(1 0

−Pr[1 ≤1� 2] 1

)−1(p1p2

)

= 0 +(

1

3

1

2

)⎛

⎝1 0

−1

31

−1 ( 1

20

)

= 1

4.

Therefore, we can validate that Pr[3 ≤3� 4] = Pr[3 <3� 4]/1 − Pr[3 <3� 4] = 1/3 and E3 = 1/1 − Pr[3 <3� 4] = 4/3.

Clearly, if one choose the Nk such that the vector p canbe computed without performing Monte Carlo simulations,then Lemmas 1–3 eliminate the need to simulate not only therandom walks of unbounded number of steps but all of them.That motivates us to call our proposed technique deterministicrandom walk.

B. Proposed Preconditioner

Based on (5) and (6), and Lemmas 1–3, we design the DRWalgorithm that computes an approximated LDLT factorizationof A. As shown in Fig. 2, the DRW algorithm depends on thechoice of the prefix set Nk and a function f that computesthe column k from approximations of Pr[k <k� i ] utilizingsome dropping and compensation scheme. Note that we donot explicitly initialize L and D in the algorithm since oneonly need to store the nonzero off-diagonal elements in thelower triangle of L and the diagonal elements of D, whichwill all be computed in the algorithm.

The For loop at line 3 can be reorganized to simplifyboth the presentation and the implementation, and to facil-itate sparse matrix operations. Define Ln+1,1:k−1 = −1T

L1:n,1:k−1. Let q be an (n + 1) × 1 vector with q1:k−1 =L−1

1:k−1,1:k−1p1:k−1, qk:n being computed as line 4, and qn+1 =pn+1 − Ln+1,1:k−1q1:k−1. We have

q =(

L1:k−1,1:k−1 0Lk:n+1,1:k−1 I

)−1

p. (10)

In other words, the For loop at line 3 can be replaced by apartial forward substitution on L up to the first k −1 columns.

Page 6: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …mypages.iit.edu/~xxiong3/files/wang_tvlsi14.pdf2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS However, since

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Since it is reasonable to assume p to be sparse and L is alsosparse, such computation can be done very efficiently withoutvisiting all the k−1 columns leveraging the algorithm proposedin [13]. To be more specific, one should first perform a depth-first search in the extended matrix graph of L1:k−1,1:k−1, start-ing from the vertices corresponding to the nonzero elementsin p1:k−1 and stopping at k, to obtain a topological orderingof the reachable vertices, and then follow such ordering to usethe columns of L.

On the other hand, the choice of f is essential for thepreconditioner D and L to meet the conditions P1–P3. Beforewe present the requirements of f , we establish the followingthree sets of invariants for the For loop at line 1.

Lemma 4: At line 3, if the first k − 1 columns of L satisfyP1, then

∑n+1i=k qi = 1, and qi ≥ pi ≥ 0,∀1 ≤ i ≤ n+1.

Proof: Consider the partial forward substitution in (10)that computes q from p using the first k − 1 columns of L.We may implement this computation as follows: first, q isinitialized as p; then in the j th iteration for j = 1, 2, . . . , k−1,we increase qi for i = j + 1, j + 2, . . . , n+1 by −q j li, j .

Since Ln+1,1:k−1 = −1T L1:n,1:k−1 by definition, it can beproved by induction that at the end of the j th iteration, wehave

∑n+1i= j+1 qi = ∑n+1

i=1 pi = 1. Therefore, at the end of the

(k − 1)th iteration, we have∑n+1

i=k qi = 1.On the other hand, the condition P1 ensures that li, j s are

nonpositive for all j < i ≤ n +1. Therefore, since all theelements of p are nonnegative (as they are probabilities), byinduction, all the elements of q will not decrease during thecomputation. Thus qi ≥ pi ≥ 0 for all 1 ≤ i ≤ n+1.

Intuitively, Lemma 4 indicates that the conditions P1 can besatisfied by requiring f to return a vector with any nonnegativeelements that add up to no more than 1.

Lemma 5: At line 3, in addition to P1, if the first k − 1columns of L further satisfy P2, then qi �= 0 only if l̂i,k �= 0for k < i ≤ n.

Proof: Consider i and j satisfying 1 ≤ j ≤ k − 1 andj < i ≤ n such that li, j �= 0. The condition P2 implies that

l̂i, j �= 0. According to (6), we must have that the set j≤ j� i is

not empty.On the other hand, based on the definition of Nk , pi �= 0

implies that the set k<k� i is not empty for 1 ≤ i ≤ n.

Therefore, similar to the proof of Lemma 4, it can be provedby induction that at the end of the j th iteration of the partialforward substitution for j = 1, 2, . . . , k − 1, qi �= 0 impliesthat the set k

<k� i is not empty for j < i ≤ n. Thus at theend of the computation, qi �= 0 implies the set k

<k� i is notempty for k < i ≤ n, and thus l̂i,k �= 0 according to (6).

Note that it is possible to have qn+1 �= 0 with an empty setk

<k� n +1 for certain k due to the choice of f so that theabove lemma will not necessarily hold for i = n+1. On theother hand, Lemma 5 indicates that the condition P2 can besatisfied by requiring f not to introduce additional nonzeroelements.

To establish the last set of invariants, we define the heighth j of a vertex j to be the shortest distance from j to n + 1in G∗

T , with each edge contributing a distance of 1. Thisdefinition is clearly consistent since n + 1 is reachable from

any other vertex. Line 5 implies that the condition P3 isequivalent to qk < 1. According to Lemma 4, it is sufficientto require qs > 0 for some k + 1 ≤ s ≤ n+1.

Lemma 6: Assume at line 3, the first k − 1 columns of Lsatisfy P1. If for any 1 ≤ j < k, there exists j < i ≤ n+1 suchthat li, j �= 0 and h j > hi , then there exists k + 1 ≤ s ≤ n+1such that qs > 0 and hk > hs .

Proof: According to the definition of the height, wehave hk > 0. Moreover, there exists a shortest path from kto n + 1 since n + 1 is reachable from k. The definition ofNk thus implies that there exists some j1 along the shortestpath satisfying that p j1 > 0 and hk > h j1 . If j1 > k, we haveproved the lemma with s = j1 since q j1 ≥ p j1 according toLemma 4 and thus q j1 > 0. Hence, we only consider the casej1 < k as follows.

Since j1 < k, our assumptions in the lemma imply that thereexists some j2 satisfying that l j2, j1 �= 0 and h j1 > h j2 . Similarto the cases for j1, we can either have j2 > k or j2 < k.If j2 < k, we can rely on the same reasoning to obtain j3 andso on, with the heights of the newly found vertices decreasingmonotonically. As the height of a vertex must be an integerand nonnegative, this process must terminate eventuallywith a sequence of vertices, denoted by j1, j2, . . . , jm forcertain m, such that first, h j1 > h j2 > · · · > h jm ; second,∏m−1

i=1 l ji+1, ji �= 0; and third, jm > k.Now let s = jm , clearly k + 1 ≤ s ≤ n + 1 and

hk > hs . According to the proof of Lemma 4, we haveqs ≥ p j1

∏m−1i=1 (−l ji+1, ji ). As both p j1 and

∏m−1i=1 (−l ji+1, ji )

are positive, we must have qs > 0.In summary, the following theorem states the correctness

of the DRW algorithm and defines a set of the sufficientrequirements that f should satisfy to guarantee the conditionsP1–P3.

Theorem 1: The DRW algorithm will always terminate.The conditions P1–P3 will be satisfied if f returns a vectorsatisfying the following requirements.

1) The elements are all nonnegative and their summationis no more than 1.

2) An element is nonzero only if the corresponding qi isnonzero.

3) There is at least one nonzero element whose correspond-ing qs satisfying that hs < hk .

Proof: It is obvious that the DRW algorithm will terminateas long as f terminates.

The first requirement and Lemma 4 imply P1. The secondrequirement and Lemma 5 imply P2. Finally, the third require-ment and Lemma 4 and 6 imply P3.

Our DRW algorithm shares some similarities with incom-plete LDLT factorization, e.g., both compute a column of Lusing the columns that are already computed. We emphasizethe two algorithms are quite different, even if the function f inthe DRW algorithm is chosen to be the same as the droppingscheme in (3). Obviously, both (2) and (3) use not onlyL1:n,1:k−1 but also D1:k−1, though our DRW algorithm onlydepends on L1:n,1:k−1. Such difference make our algorithmnot sensitive to the error in the D matrix. On the otherhand, (3) indicates that the column k of L is computed fromthe columns of L corresponding to the nonzero elements

Page 7: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …mypages.iit.edu/~xxiong3/files/wang_tvlsi14.pdf2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS However, since

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: DETERMINISTIC RANDOM WALK 7

in Lk,1:k−1. Although such columns should be the same as thecolumns corresponding to the aforementioned set of reachablevertices in our algorithm when the factorization is exact,they are different when some elements are dropped from L.We observed that our algorithm will usually locate morecolumns than incomplete LDLT factorization, which shouldappear in an exact factorization, resulting in a better approxi-mation at the cost of more computation time. We believe thatsuch cost is what one has to pay to achieve better quality viaa more flexible dropping and compensation scheme.

IV. IMPLEMENTATION DETAILS

There are many ways to implement the various componentsof the DRW algorithm while still satisfying the requirementsin Theorem 1. We present in this section our choices whenimplementing the DRW algorithm for power grid analysis,together with a system simplification (SS) approach that mayreduce the solver running time.

A. Matrix Ordering

Although Theorem 1 does not impose any restriction onhow the vertices should be ordered, we have observed thatthe matrix ordering does impact both the time to computethe preconditioner and its quality (thus the time to solve theproblem iteratively). Our implementation may choose eitherthe reverse Cuthill–McKee (RCM) ordering [9] to reduce theenvelope of the matrices, or the approximate minimum degree(AMD) ordering [1] to reduce the number of fill-ins.

In general, we prefer AMD over RCM since we found in ourexperiments that it leads to DRW preconditioners with betterquality in all cases. For systems with large bandwidth due tocouplings between the vertices, e.g., for transient simulation,the AMD ordering also results in less time to compute thepreconditioner, since there will be less computations due toless fill-ins, despite the extra time to obtain the ordering incomparison with RCM. For systems with narrow bandwidthand large number of levels, e.g., for dc analysis, the RCMordering could be a better choice since it takes less time tocompute the preconditioner and the preconditioner quality iscomparable with that obtained using AMD.

B. Set Nk

We choose Nk to be the set of all the one-step prefixes, i.e.,the one-step random walks from k. The p on line 2 of theDRW algorithm is computed directly from A1:n,k , i.e., p j =−a j,k/ak,k for j �= k and pk = 0. Note that we store thewhole matrix A to make such computation simple instead ofonly storing half of A exploiting that A is symmetric.

Other choices of Nk may improve the quality of thepreconditioner at the cost of more running time to compute orto approximate p. In the extreme case, Nk could include allthe random walks themselves in

⋃n+1i=k k

<k� i . In such case,p j = 0 for all j > k and p j for j ≤ k may be approximated byMonte Carlo simulations, leading to a variant of the stochasticrandom walk preconditioner in [32] utilizing Lemmas 1 and 2.Practically, we have also experimented with the choice of Nk

that consists of all the one-step prefixes ending at a vertexlarger than k, and all the two-step prefixes. The exact p forthis case can be computed via a breadth-first search on theextended matrix graph of A, though the improvements on thequality of the preconditioner and the overall running time arerather marginal. It remains an open problem that whether thereexist certain choices of Nk that could improve the performanceof deterministic random walk preconditioning considerably,possibly for specific power grid structures.

C. Dropping and Compensation Scheme

We specify the dropping and compensation scheme in f intwo levels—globally and per column.

Globally, three parameters should be provided: a fillingfactor γ specifying the desired ratio of the number of thenonzero off-diagonal elements in L to that in A, a keeptolerance δ+ such that any element of L whose absolute valueis larger should always be kept, and a drop tolerance δ− suchthat any element of L whose absolute value is smaller shouldalways be dropped. We do not actually tune these parametersin our experiments. The parameter γ is used to constraint thesize, and thus the memory usage, of the preconditioner. Theparameter δ+ is always set to 0.05 as a safe guard in casethe probabilities of the random walks are distributed evenly.The parameter δ− is always set to 0.00005. Our experimentalresults are not sensitive to the values of δ+ and δ− unless theformer is too small or the latter is too large, e.g., 0.001.

Column-wise, when the column k of L should be computedusing f , we first compute the overall budget of the nonzerooff-diagonal elements on the remaining columns of L, i.e.,L1:n,k:n , from the number of the nonzero off-diagonal elementsin A and L1:n,1:k−1 and the global filling factor γ. Then, wecompute �, the per-column budget of the nonzero off-diagonalelements for the column k, as the overall budget divided byn−k+1. In other words, we assume that the remaining nonzerooff-diagonal elements in L should be distributed evenly acrossthe remaining columns. If the computed � is less than 2,we will increase it to 2. For the function f , we first computeqn+1 as 1 − ∑n

i=k+1 qi and then find the largest qs fromqk+1, qk+2, . . . , qn+1 such that hk > hs , which is guaranteedby Lemma 6 to be nonzero. We replace all but the largest �elements that are at least δ− in qk+1, qk+2, . . . , qn by 0. If anyelement being replaced is larger than δ+ or qs is replaced, wewill replace it back. We found in our experiments that whenwe choose qs as the aforementioned largest one, in most casesit is one of the largest � elements and no additional nonzeroelement is introduced to the returned vector. Let the resultingelements be q ′

k+1, q ′k+2, . . . , q ′

n . The function f will return acompensated vector as follows:

f (qk, qk+1, . . . , qn)

=∑n

i=k+1 qi∑n

i=k+1 q ′i

(q ′

k+1

1 − qk,

q ′k+2

1 − qk, . . . ,

q ′n

1 − qk

)

i.e., the dropped elements are added to the remaining onesproportionally. Note that if all of qk+1, qk+2, . . . , qn are 0,f will simply return 0.

Page 8: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …mypages.iit.edu/~xxiong3/files/wang_tvlsi14.pdf2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS However, since

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Overall, our dropping scheme is an extension of [19], whilethe compensation scheme, though intuitive, cannot be applieddirectly there.

D. System Simplification via Partial Exact Factorization

For public IBM power grid benchmarks [24], [25], espe-cially those for transient simulation, there are rows andcolumns with no more than 3 nonzero off-diagonal elements inthe system matrix A. As A could be reordered such that manyof these columns can be factored exactly without introducingadditional fill-ins, we may exclude them from the iterativesolver to reduce the running time via a preprocessing step thatcomputes a Schur complement in A by partial exact factoriza-tion. A postprocessing step will recover the original solutionwithout affecting its accuracy. Overall, our SS approach couldbe treated as a combination of the direct and iterative solverswhere the direct solver is used to solve part of the system thatcan be easily factored, therefore reducing the running timerequired by the iterative solver to work on such easy part ofthe system.

For 1 ≤ m < n, let Sm be the Schur complement of A1:m,1:min A

Sm = Am+1:n,m+1:n − Am+1:n,1:m A−11:m,1:m A1:m,m+1:n .

We can rewrite Ax = b as[

L̂1:m,1:m 0L̂m+1,1:m I

] [D̂1:m,1:m 0

0 Sm

] [L̂T

1:m,1:m L̂Tm+1,1:m

0 I

]

x = b.

(11)

In other words, one can solve the original system by solvingthe Schur complement Sm and performing a pair of partialforward/back substitution using the first m columns of L̂.According to [4], Sm is a nonsingular symmetric diago-nally dominant matrix with nonpositive off-diagonal elements.Therefore, we can still apply the DRW algorithm on Sm toobtain a preconditioner to solve it iteratively. The concerns ofthis approach, however, are that first, storing L̂1:n,1:m and Sm

explicitly could introduce tremendous storage overhead overA, and second, the reduction in the time to solve the problemiteratively could be marginal so that the extra work requiredis not worthwhile.

To address the first concern, we restrict m such that none ofthe first m columns of L̂ will have more than 3 off-diagonalnonzeros. Note that for any column of L̂ with no more than 2off-diagonal nonzeros, no fill-in will be introduced to the Schurcomplement, and for any column with 3 off-diagonal nonzeros,there will be at most 3 fill-ins. These imply that the number ofoff-diagonal nonzeros in Sm will be no more than that of A.Therefore, if we use the same γ to control the size of thepreconditioner of Sm as that of A, as mentioned inSection IV-C, then in the worst case we need an extra storagefor the 3m nonzeros from L̂1:n,1:m . On the other hand,if we further consider the iterative solver that solves theSchur complement instead of the original system, the overallstorage requirement may actually be less since the vectorsparticipating in the computation will all consist of only n −m,instead of n, elements.

TABLE I

RESULT VALIDATION FOR DC ANALYSIS

To address the second concern, we adopt an iterative greedyheuristic to reorder A to maximize m, and to compute Sm ,D̂1:m,1:m , and L̂1:n,1:m simultaneously during the reordering.Initially, we set m = 0 and Sm = A. For each iteration,we choose an arbitrary column of Sm that has the minimumnumber of off-diagonal nonzeros. We terminate the process ifthe number of the off-diagonal nonzeros are larger than 3 orif m = n. Otherwise, we reorder Sm (and thus A) such thatthe chosen column will be the next to be eliminated. Notethat since to eliminate a column with 3 off-diagonal nonzerosmay possibly introduce fill-ins to other columns with 3 off-diagonal nonzeros, this heuristic does not necessarily lead tothe maximum m across all the possible reorderings of A,though our experimental results show considerable reductioncan still be achieved for aforementioned transient simulationbenchmarks.

V. EXPERIMENTS

A. Experimental Setup and Result Validation

We implemented our DRW algorithm and a PCG solver inC++ to solve power grid analysis problems. For comparison,we implemented in our solver the incomplete LDLT factoriza-tion preconditioner as described in Section II-B using the datastructure proposed by [19]. The same dropping scheme as ourDRW algorithm is applied without compensation. Moreover,we integrated in our solver the binary library for stochasticrandom walk [32] to generate the preconditioners. Note thatthe library used some internal matrix ordering we cannotmodify and that we did not use their library to actually solvethe equations since we found their solver to be always slowerthan ours. In addition, the AMD package [1] was used togenerate the AMD ordering and the CHOLMOD package [7]was employed to provide comparisons between the iterativeand direct solvers. All programs were built by GCC version 4.3and we ran all the experiments on a 64-bit Linux workstationwith a 2.4-GHz Intel Q6600 processor and 8-GB memory.Note that only one core was used for all the experiments.

We first validated our implementation using the public IBMpower grid benchmarks [24], [25], [29] against their publishedgolden solutions. The results of the largest six circuits for dcanalysis are reported in Table I and a comparison to the state-of-the-art AMG–PCG solver PowerRush [42] is provided.Similar to [42], we preprocessed the SPICE netlist to treatresistances less than 10−6 � as shorts. To reduce the timeto compute the DRW preconditioner, the RCM ordering wasused and γ was set to 1. The PCG iterations were terminated

Page 9: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …mypages.iit.edu/~xxiong3/files/wang_tvlsi14.pdf2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS However, since

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: DETERMINISTIC RANDOM WALK 9

when the l2 norm of the residue is reduced to 10−3 ofits original value. In Table I, we first show the dimensionof A (n) and the number of nonzero elements in A (nnz A).Then, for each solver, we show the total solving time inseconds (tsol) and the maximum/average errors to the goldensolution in uV (Err.). For our solver, the column tsol includesthe time to reorder A, to compute the preconditioner, toperform PCG iterations, and to compute the maximum/averageerrors. For PowerRush, the two columns were obtained from[42, Table 1] and we compensated for the different proces-sors used for the experiments by scaling down the totalsolving time using a factor of 2.13/2.40. Although differentprocessor microarchitectures may also affect the solving times,we believe that the comparison is reasonably fair so thatwe can conclude our PCG solver with DRW preconditioningcan generate results with better accuracy in less time thanPowerRush. For our DRW preconditioner, we further showedthe number of nonzero elements in L (nnzL ) that can be usedto estimate the memory usage. Since to store both L and Drequires 12nnzL bytes of memory for 4-bit integer indices and8-bit floating point values, the maximum memory usage forour DRW preconditioner was less than 50 MB for all the dcanalysis benchmarks.

The results of the six circuits for transient simulation arereported in Table II. Each SPICE netlist was first processedto generate two system matrices—one for dc operating pointcomputation and the other for each discretized time step viathe trapezoidal rule. We only report the later in the tableas it dominated the memory usage and the running time.The SS technique as mentioned in Section IV was appliedto simplify both system matrices. Our DRW preconditionerswere then computed after the AMD ordering for two settingsof γ. The number of nonzero elements in the partial exactfactorization and the preconditioner are collectively reportedunder the columns nnzL . The maximum memory usage forour DRW preconditioner was thus less than 150 MB for allthe transient simulation benchmarks. Since the trapezoidal rulewas applied for discretization, we used the solution that is twotime steps back as the initial guess for the current time stepand terminated the PCG iterations when the l2 norm of theresidue was reduced to 10−3 of its original value. The columnsErr. confirm such choice as the maximum/average errors wereclose to those obtained by the direct solvers [40], [43]. Thetotal running time (ttot) includes everything from reading theSPICE netlist to writing the simulation outputs. It can be seenthat our DRW preconditioner is able to explore the tradeoffsbetween memory usage and solution time—the later can bereduce by 33% with less than 50% memory overheads for thepreconditioner.

Overall, Table II showed that our solver is able to completeall the six circuits for transient simulation within 1 h withreasonable amount of memory overhead. It is known that directsolvers are the most efficient for transient simulation whenthere is enough memory and all the winners [40], [43], [44]of the recent TAU Power Grid Simulation Contest [24] werebased on direct solvers. The results shown later in Table IVimply a similar running time and memory overhead as reportedin the above publications—on our workstation, all the transient

TABLE II

RESULT VALIDATION FOR TRANSIENT SIMULATION

simulations would require a total running time of 20 min usinga solver based on CHOLMOD [7], which will be 3× fasterthan our solver, but need more than 5× fill-ins. On the otherhand, in comparison with the solvers based on stochastic ran-dom walk [32] and incomplete LDLT factorization, Table IVimplies that they will be at least 2× slower than our solverwith the same number of fill-ins.

B. Detailed Comparison

We then provide a detailed comparison of our DRWpreconditioner and the other two preconditioners,i.e., stochastic random walk [32] and incomplete LDLT

factorization, for the two sets of the benchmarks, respectively.For all the benchmarks, the PCG solver solved the same100 random vectors for all the preconditioners and wasterminated when the l2 norm of the residue was reduced to10−6 of its original value, or for a maximum of 200 iterations.

The results for the dc analysis benchmarks are shownin Table III. For both of our DRW preconditioner andincomplete LDLT factorization, we used the same RCMordering and set γ to 1.7 to match the total number ofnonzero elements in L (nnzL ) across all the benchmarks.In comparison to our DRW preconditioner, it can be seenthat incomplete LDLT factorization spent very little time tocompute the preconditioner (tpre) but required about 1.7×PCG iterations (#it.) and about 1.7× time to solve theproblems in the PCG solver (tpcg). The stochastic randomwalk was 3.7× slower to compute the preconditioner butrequired 20% less PCG iterations. However, it was 2.2×slower to actually solve the problems, due to its internalordering that may incur considerable amount of overhead thanthe RCM ordering for sparse matrix–vector multiplications.Note that when the SS technique, as mentioned in Section IV,and the AMD ordering were applied for the dc analysisbenchmarks, the results for our DRW preconditioner wereslightly better, though those for the other two were slightlyworse.

The results for the transient simulation benchmarks areshown in Table IV. The system matrix was first obtained viathe trapezoidal rule. We then applied the SS technique sincewe found it to be effective for all the three preconditioner.For both of our DRW preconditioner and incomplete LDLT

factorization, we used the same the AMD ordering [1] sincewe found that it led to less time to compute the preconditionerand better convergence. Moreover, we set γ to 1.7 for both

Page 10: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …mypages.iit.edu/~xxiong3/files/wang_tvlsi14.pdf2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS However, since

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE III

DETAILED COMPARISON FOR DC ANALYSIS BENCHMARKS

TABLE IV

DETAILED COMPARISON FOR TRANSIENT SIMULATION BENCHMARKS

of them to match the total number of nonzero elements inthe partial exact factorization and the preconditioner (nnzL ).It can be seen that the performance of incomplete LDLT fac-torization degraded for this set of benchmarks—in comparisonto our DRW preconditioner, it used 40% time to compute thepreconditioner but required about 3× PCG iterations and about3× time to solve the problems, with certain runs not able toconverge in 200 iterations. The stochastic random walk was6.8× slower to compute the preconditioner and 2.4× slower tosolve the problems, though requiring almost the same numberof PCG iterations.

We further report the results of the direct solverCHOLMOD [7] in both tables. Clearly, although the time tosolve the problem by a pair of forward/back substitution (tfbs)was much smaller than our iterative solver using the DRWpreconditioner, there were 8× more fill-ins in the exact factorsand the time to compute them (tfac) was more than that tocompute our DRW preconditioners, making our technique abetter choice if the exact factors cannot be held in memory orif the system is only solved for a few times.

VI. CONCLUSION

In this paper, we presented the deterministic random walkpreconditioning technique for power grid analysis via PCGsolvers. Compared to stochastic random walk preconditioning,our proposed algorithm computed the preconditioners in adeterministic manner to reduce computation time. Comparedwith incomplete LDLT factorization, our proposed algorithmleveraged a compensation scheme to improve preconditionerquality, while maintaining correctness. Solvers built on topof our proposed preconditioner were able to outperform otherstate-of-the-art iterative solvers for dc analysis and transientsimulation.

REFERENCES

[1] P. Amestoy, T. A. Davis, and I. S. Duff, “Algorithm 837: AMD, anapproximate minimum degree ordering algorithm,” ACM Trans. Math.Softw., vol. 30, no. 3, pp. 381–388, Sep. 2004.

[2] M. Benzi, C. D. Meyer, and M. Tuma, “A sparse approximate inversepreconditioner for the conjugate gradient method,” SIAM J. Sci. Comput.,vol. 17, no. 5, pp. 1135–1149, 1996.

[3] M. Bern, J. R. Gilbert, B. Hendrickson, N. Nguyen, and S. Toledo,“Support-graph preconditioners,” SIAM J. Matrix Anal. Appl., vol. 27,no. 4, pp. 930–951, 2006.

[4] D. Carlson and T. L. Markham, “Schur complements of diagonallydominant matrices,” Czechoslovak Math. J., vol. 29, no. 2, pp. 246–251,1979.

[5] H. H. Chen and D. D. Ling, “Power supply noise analysis methodologyfor deep-submicron VLSI chip design,” in Proc. 34th Design Autom.Conf., Jun. 1997, pp. 638–643.

[6] T.-H. Chen and C. C. Chen, “Efficient large-scale power grid analysisbased on preconditioned Krylov-subspace iterative methods,” in Proc.Design Autom. Conf., 2001, pp. 559–562.

[7] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam, “Algo-rithm 887: CHOLMOD, supernodal sparse Cholesky factorizationand update/downdate,” ACM Trans. Math. Softw., vol. 35, no. 3,pp. 22:1–22:14, Oct. 2008.

[8] C.-H. Chou, N.-Y. Tsai, H. Yu, C.-R. Lee, Y. Shi, and S.-C. Chang,“On the preconditioner of conjugate gradient method—A power gridsimulation perspective,” in Proc. Int. Conf. Comput.-Aided Design,Nov. 2011, pp. 494–497.

[9] E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetricmatrices,” in Proc. ACM 24th Nat. Conf., 1969, pp. 157–172.

[10] A. Dharchoudhury, R. Panda, D. Blaauw, R. Vaidyanathan, B. Tutuianu,and D. Bearden, “Design and analysis of power distribution networks inPowerPCTM microprocessors,” in Proc. Design Autom. Conf., Jun. 1998,pp. 738–743.

[11] Z. Feng and P. Li, “Multigrid on GPU: Tackling power grid analysis onparallel SIMT platforms,” in Proc. IEEE/ACM Int. Conf. Comput.-AidedDesign, Nov. 2008, pp. 647–654.

[12] Z. Feng, X. Zhao, and Z. Zeng, “Robust parallel preconditioned powergrid simulation on GPU with adaptive runtime performance modelingand optimization,” IEEE Trans. Comput.-Aided Design Integr. CircuitsSyst., vol. 30, no. 4, pp. 562–573, Apr. 2011.

[13] J. R. Gilbert and T. Peierls, “Sparse partial pivoting in time proportionalto arithmetic operations,” SIAM J. Sci. Statist. Comput., vol. 9, no. 5,pp. 862–874, Sep. 1988.

Page 11: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION …mypages.iit.edu/~xxiong3/files/wang_tvlsi14.pdf2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS However, since

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WANG et al.: DETERMINISTIC RANDOM WALK 11

[14] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed.Baltimore, MD, USA: The Johns Hopkins Univ. Press, 1996.

[15] I. Gustafsson, “A class of first order factorization methods,” BIT Numer.Math., vol. 18, no. 2, pp. 142–156, Jun. 1978.

[16] C.-W. Ho, A. E. Ruehli, and P. A. Brennan, “The modified nodalapproach to network analysis,” IEEE Trans. Circuits Syst., vol. 22, no. 6,pp. 504–509, Jun. 1975.

[17] X. Hu, W. Zhao, P. Du, A. Shayan, and C.-K. Cheng, “An adaptiveparallel flow for power distribution network simulation using discreteFourier transform,” in Proc. 15th Asian South Pacific Design Autom.Conf., Jan. 2010, pp. 125–130.

[18] Y.-M. Jiang and K.-T. Cheng, “Analysis of performance impact causedby power supply noise in deep submicron devices,” in Proc. 36th DesignAutom. Conf., 1999, pp. 760–765.

[19] M. T. Jones and P. E. Plassmann, “An improved incomplete Choleskyfactorization,” ACM Trans. Math. Softw., vol. 21, no. 1, pp. 5–17,Mar. 1995.

[20] J. N. Kozhaya, S. R. Nassif, and F. N. Najm, “A multigrid-like techniquefor power grid analysis,” IEEE Trans. Comput.-Aided Design Integr.Circuits Syst., vol. 21, no. 10, pp. 1148–1160, Oct. 2002.

[21] Y.-M. Lee, Y. Cao, T.-H. Chen, J. M. Wang, and C.-P. Chen, “HiPRIME:Hierarchical and passivity preserved interconnect macromodeling enginefor RLKC power delivery,” IEEE Trans. Comput.-Aided Design Integr.Circuits Syst., vol. 24, no. 6, pp. 797–806, Jun. 2005.

[22] D. Li, S.-D. Tan, and B. McGaughy, “ETBR: Extended truncatedbalanced realization method for on-chip power grid network analysis,”in Proc. Design Autom. Test Eur., Mar. 2008, pp. 432–437.

[23] P. Li, “Power grid simulation via efficient sampling-based sensitivityanalysis and hierarchical symbolic relaxation,” in Proc. 42nd DesignAutom. Conf., Jun. 2005, pp. 664–669.

[24] Z. Li, R. Balasubramanian, F. Liu, and S. Nassif, “2012 TAU power gridsimulation contest: Benchmark suite and results,” in Proc. IEEE/ACMInt. Conf. Comput.-Aided Design, Nov. 2012, pp. 643–646.

[25] Z. Li, F. Liu, R. Balasubramanian, and S. Nassif, “TAU 2011 powergrid analysis contest,” in Proc. ACM Int. Workshop Timing IssuesSpecification Synth. Digital Syst., Nov. 2011, pp. 478–481.

[26] T. A. Manteuffel, “An incomplete factorization technique for positivedefinite linear systems,” Math. Comput., vol. 34, no. 150, pp. 473–497,Apr. 1980.

[27] J. A. Meijerink and H. A. van der Vorst, “An iterative solution methodfor linear systems of which the coefficient matrix is a symmetricM–matrix,” Math. Comput., vol. 31, no. 137, pp. 148–162, Jan. 1977.

[28] S. R. Nassif and J. N. Kozhaya, “Fast power grid simulation,” in Proc.Design Autom. Conf., 2000, pp. 156–161.

[29] S. R. Nassif, “Power grid analysis benchmarks,” in Proc. Asian SouthPacific Design Autom. Conf., 2008, pp. 376–381.

[30] Y. Notay, “Aggregation-based algebraic multilevel preconditioning,”SIAM J. Matrix Anal. Appl., vol. 27, no. 4, pp. 998–1018, 2006.

[31] H. Qian, S. R. Nassif, and S. S. Sapatnekar, “Power grid analysis usingrandom walks,” IEEE Trans. Comput.-Aided Design Integr. CircuitsSyst., vol. 24, no. 8, pp. 1204–1224, Aug. 2005.

[32] H. Qian and S. S. Sapatnekar, “Stochastic preconditioning for diagonallydominant matrices,” SIAM J. Sci. Comput., vol. 30, no. 3, pp. 1178–1204,Mar. 2008.

[33] S. S. Sapatnekar, “Overcoming variations in nanometer-scale technolo-gies,” IEEE Trans. Emerg. Sel. Topics Circuits Syst., vol. 1, no. 1,pp. 5–18, Mar. 2011.

[34] J. Shi et al., “GPU friendly fast Poisson solver for structured powergrid network analysis,” in Proc. 46th ACM/IEEE Design Autom. Conf.,Jul. 2009, pp. 178–183.

[35] G. Steele, D. Overhauser, S. Rochel, and S. Z. Hussain, “Full-chipverification methods for DSM power distribution systems,” in Proc.Design Autom. Conf., Jun. 1998, pp. 744–749.

[36] H. Su, E. Acar, and S. R. Nassif, “Power grid reduction based onalgebraic multigrid principles,” in Proc. Design Autom. Conf., 2003,pp. 109–112.

[37] K. Sun, Q. Zhou, K. Mohanram, and D. C. Sorensen, “Parallel domaindecomposition for simulation of large-scale power grids,” in Proc.IEEE/ACM Int. Conf. Comput.-Aided Design, Nov. 2007, pp. 54–59.

[38] J. Wang, “Deterministic random walk preconditioning for powergrid analysis,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design,Nov. 2012, pp. 392–398.

[39] J. M. Wang and T. V. Nguyen, “Extended Krylov subspace method forreduced order analysis of linear circuits with multiple sources,” in Proc.Design Autom. Conf., 2000, pp. 247–252.

[40] X. Xiong and J. Wang, “Parallel forward and back substitution forefficient power grid simulation,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design, Nov. 2012, pp. 660–663.

[41] J. Yang, Y. Cai, Q. Zhou, and J. Shi, “Fast Poisson solver preconditionedmethod for robust power grid analysis,” in Proc. IEEE/ACM Int. Conf.Comput.-Aided Design, Nov. 2011, pp. 531–536.

[42] J. Yang, Z. Li, Y. Cai, and Q. Zhou, “PowerRush: A linear simulatorfor power grid,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design,Nov. 2011, pp. 482–487.

[43] J. Yang, Z. Li, Y. Cai, and Q. Zhou, “PowerRush: Efficient transientsimulation for power grid analysis,” in Proc. IEEE/ACM Int. Conf.Comput.-Aided Design, Nov. 2012, pp. 653–659.

[44] T. Yu and M. D. F. Wong, “PGT_SOLVER: An efficient solver for powergrid transient analysis,” in Proc. IEEE/ACM Int. Conf. Comput.-AidedDesign, Nov. 2012, pp. 647–652.

[45] M. Zhao, R. V. Panda, S. S. Sapatnekar, and D. Blaauw, “Hierarchicalanalysis of power distribution networks,” IEEE Trans. Comput.-AidedDesign Integr. Circuits Syst., vol. 21, no. 2, pp. 159–168, Feb. 2002.

[46] S. Zhao, K. Roy, and C.-K. Koh, “Frequency domain analysis ofswitching noise on power supply network,” in Proc. IEEE/ACM Int.Conf. Comput.-Aided Design, Nov. 2000, pp. 487–492.

[47] X. Zhao, J. Wang, Z. Feng, and S. Hu, “Power grid analysis withhierarchical support graphs,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design, Nov. 2011, pp. 543–547.

[48] Y. Zhong and M. D. F. Wong, “Fast algorithms for IR drop analysis inlarge power grid,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design,Nov. 2005, pp. 351–357.

[49] Z. Zhu, B. Yao, and C.-K. Cheng, “Power network analysis using anadaptive algebraic multigrid approach,” in Proc. Design Autom. Conf.,Jun. 2003, pp. 105–108.

[50] C. Zhuo, J. Hu, M. Zhao, and K. Chen, “Power grid analysis andoptimization using algebraic multigrid,” IEEE Trans. Comput.-AidedDesign Integr. Circuits Syst., vol. 27, no. 4, pp. 738–751, Apr. 2008.

Jia Wang (M’08) received the B.S. degree inelectronic engineering from Tsinghua University,Beijing, China, in 2002, and the Ph.D. degree incomputer engineering from Northwestern University,Evanston, IL, USA, in 2008.

He is currently an Associate Professor with theDepartment of Electrical and Computer Engineering,Illinois Institute of Technology, Chicago, IL, USA.His current research interests include computer-aided design of very large scale integrated circuitsand algorithm design.

Xuanxing Xiong (S’11–M’13) received theB.E. degree in electronic science and technology(microelectronics) from the University of ElectronicScience and Technology of China, Chengdu,China, in 2005, and the M.S. and Ph.D. degrees inelectrical engineering from the Illinois Institute ofTechnology, Chicago, IL, USA, in 2010 and 2013,respectively.

He is currently a Senior Research andDevelopment Engineer with the Design Group,Synopsys, Inc., Mountain View, CA, USA. His

current research interests include algorithm designs for electronic designautomation and high-performance computing.

Xingwu Zheng received the bachelor’s degreein electrical engineering from Sichuan University,Chengdu, China, in 2005, and the master’s degreein software engineering from Beihang University,Beijing, China, in 2010. He is currently pursuing thePh.D. degree with the Department of Electrical andComputer Engineering, Illinois Institute of Technol-ogy, Chicago, IL, USA.

His current research interests include verylarge scale integrated computer-aided design, high-performance computing, and parallel computing.