department mn {razaoo07, luozq}lumnpeople.ece.umn.edu/~luozq/assets/pdf/publications... · sequel,...

DISTRIBUTED OPTIMIZATION IN AN ENERGY-CONSTRAINED NETWORK

Alireza Razavi and Zhi-Quan Luo

Department of Electrical and Computer EngineeringUniversity of Minnesota

Minneapolis, MN 55455, U.S.A.Email: {razaOO07, luozq}lumn.edu

ABSTRACT

We consider a distributed optimization problem whereby two nodesSl, S2 wish to jointly minimize a common convex quadratic costfunction f(Xi, X2), subject to separate local constraints on xi andX2, respectively. Suppose that node Si has control of variable xionly and node S2 has control of variable X2 only. The two nodeslocally update their respective variables and periodically exchangetheir values over a noisy channel. Previous studies of this problemhave mainly focused on the convergence issue and the analysis ofconvergence rate. In this work, we focus on the communication en-ergy and study its impact on convergence. In particular, we con-sider a class of distributed stochastic gradient type algorithms im-plemented using certain linear analog messaging schemes. We studythe minimum amount of communication energy required for the twonodes to compute an 6-minimizer of f(Xi, X2) in the mean squaresense. Our analysis shows that the communication energy must growat least at the rate of Q(E-1). We also derive specific designs whichattain this minimum energy bound, and provide simulation resultsthat confirm our theoretical analysis. Extension to the multiple nodecase is described.

Index Terms- Distributed optimization, Sensor networks, En-ergy constraint, Stochastic gradient method, Convergence

1. INTRODUCTION

Consider a network ofn nodes which collaborate to minimize a costfunction f (X1, X2, . . ., Xn), subject to separate constraints on xi,where xi is a local (vector) variable controlled by node Si. Eachnode can perform local computation and exchange messages witha set of predefined neighbors through orthogonal noisy channels.Moreover, we assume f(xi, X2, . . ., Xn) has a certain "local struc-ture " in the sense that its partial derivative with the respect to xi onlydepends on the local variables at node Si and its neighbors.

A distributed optimization problem of this kind arises naturallyin sensor network applications. For example, in the sensor localiza-tion problem, we are given the locations of anchor nodes and dis-tance measurements between certain neighbor nodes in the network.The goal is to estimate the locations of all sensors in the network bydistributed minimization of a cost function f(Xi, X2, . . ., Xn) de-fined by the L1 norm of distance errors [2]. In this context, xi isthe location of sensor Si and is to be estimated by Si. The locationxi may be required to satisfy some local constraints representing apriori information (e.g. range) available at sensor Si. To minimizef(Xi, . * * *, Xn), sensor Si periodically updates its local variable,

This work is supported by the USDOD Army, grant number W91 INF-05- 1-0567.

xi, and exchanges information with neighbor nodes through orthog-onal noisy channels. A special feature of this problem is the fact thatnodes are battery operated and hence energy-constrained. Note thatenergy of each node is consumed for various operations includinglocal computation and inter-sensor communication, with the latterbeing the dominant part. This motivates us to study the minimumamount of communication energy required for distributed optimiza-tion.

Energy consumption has not been a consideration of algorithmdesign in classical distributed optimization [1]. Even recent studiesof distributed optimization in the context of sensor networks [6, 3]have mainly focused on convergence issues such as convergence cri-teria and convergence rate. To the best of our knowledge, the mostrelevant work to this paper is [5] which studied the minimum numberof bits that must be exchanged between two nodes in order to findan 6-minimizer of f. However, unlike our current work the commu-nication channel is assumed distortion-less in [5], and there was noeffort to characterize minimum energy consumption.

The main contributions of this paper are twofold. First, we es-tablish an asymptotic lower bound for communication energy re-quired to obtain an e-minimizer of f. Second, we provide specificdesigns which attain this minimum energy bound. We start with atwo node case which is later generalized to the multiple node case.The considered cost function is quadratic and convex. The general-ization of this work to a general cost function (e.g f(xi, 2X2.... Ix* )in the stated sensor localization problem) is a subject of ongoing re-search.

We adopt the following notations: all matrices and vectors willbe denoted by upper case characters and bold lower case, respec-tively. For any variable xi, we use xi (t) to indicate the value of xiat node Sj and iteration t. An asymptotic lower bound and asymp-totically tight bound will be denoted by Q and e), respectively.

2. ALGORITHM FRAMEWORK

We consider a distributed optimization problem with n nodes, Si, i c{1, ....I n}, jointly minimizing a convex quadratic function f (x) =lxTAx + bTX + C,X [xi.... Xnl T, where A C RnnIA>-2

0 and b, C C Rn, . Node Si has control of scalar variable xiand knows Ai and bi which are the ith row of A and b for i c{1, ... . n}, respectively. Nodes communicate through orthogonaltime-invariant noisy channels using analog messaging. The commu-nication channel between nodes Si and Si is corrupted by additivenoise, nrt 1 (t), with zero mean and variance aj . In this model, thereceived signal by node Si from Si, ri (t), is

(1)

1-4244-0728-1/07/$20.00 ©2007 IEEE

ri'l (t) = y' (t) + n" 1 (t))

III - 189 ICASSP 2007

where yi (t) is the signal transmitted by node Si. Therefore, commu-nication energy at iteration t is proportional to E [ yi (t) I2] where Eis the expectation operator. This paper aims to derive the communi-cation energy required to obtain an 6-minimizer of f (x) in the meansquare sense. A point x is an 6-minimizer of f (x) in the mean squaresense if E[Ix-x*IxI] < 6, where x* is the optimum point. In thesequel, we start with a two node case and will extend the results tothe multiple node case in Section 3.1.

A distributed algorithm consists of two parts: a communicationscheme and a local computation scheme at each node.A. Communication scheme: After each local update, node Si, i C{ 1, 2} should relay its information to the other node. One way is todirectly send the updated value of its local variable, xz (t+ 1), result-ing in a communication power of E[ xi (t + 1) 112] which convergesto a constant value. An alternative way is to send the incrementalvalue, xz(t + 1)-xz(t), in which case the communication power,E[ xlx(t + 1)-xi(t) 2], would vanish if xz converges. In general,we can consider a linear analog messaging scheme where the trans-mitted signal, y (t), is given as,

y, (t) = [t'i (t) - - (t) X' (t- 1) ], t = I 2In this equation, -y(t) is a positive coefficient to be chosen (-y(l)0). Therefore, the total communication energy of Si is

T

E om(T) >JE x (t) -}(t)Xi(t 1 2 (3)t=l

Equation (3) shows that in order to reduce communication energy,-y(t) should converge to 1. Note that the choice of -y(t) must alsoensure the convergence of the distributed algorithm.B. Local computation scheme: Optimization algorithms in the pres-ence of noise can be performed based on the stochastic gradient typealgorithm or the so-called Robbins-Monro algorithm [4]. One itera-tion of this algorithm is given as,

x(t + 1) = x(t) -a(t)g(x), t=1,.where a(t) is a vanishing positive step size satisfying

S a(t) -* oc; (a(t))2 < oc, and g(x) is a noisy version oft=l t=lthe gradient vector of f (x). We consider a distributed implementa-tion ofthe stochastic gradient type algorithm whereby Si, i c { 1, 2}tracks the other node variable according to

X"i (t + 1) = -y(t)X" (t) + r, (t) (4)Here rj'i(t) is the received message from node Sj at iteration t asdefined by (1)-(2). Node Si uses this noisy copy to estimate gi(x),the partial derivative of f (x) with the respect of its local variable Xi,and update xi as

at node Si:z1 (t + 1) = z1 (t) -t 91gi(X:1 (t), x2 (t + 1))

Theorem 1

(a) The distributed algorithm described by (2), (4), (5) convergesto the global minimum off (x) in the mean square sense when-y(k) satisfies t

(6)lim X, (o(i, t))2 -, O.i=o

(b) To obtain an e-minimizer off (x) in the mean square sense, therequired communication energy must be at least Q(- 1).

Proof (a) We start with the iterative update in equation (5) and ob-tain a tight bound for the mean square error in terms of the initialvalues of local variables and channel noise variance. Then, we showthat this bound converges to zero under condition (6). We will needthe following lemma which is stated here without a proof,

Lemma I For any positive, decreasing integrable function h(x),there holds

ot+l t t t

j h(x)dx < h(j) < h(i) + j h(x)dx < h(x)dx.J=i 1

Define error at the (t+l)th iteration as e(t+ 1) : x(t+ 1)-x,where x(t + 1) := [1(t + 1),x(t + 1)] is the vector of localvariables in respective nodes at (t + I)th iteration. Using equations(1)-(2), (4)-(5), we can show

e(t + 1) = (I tA1) e(t) + I v(t).In this equation, the accumulated channel noise, v(t), is given as,

v(t)=-

t t

al,2 E I| -y(k)n2'1 (i)i=l =i+l

t t

al,2 E I| -y(k)nl'2(i)i=l =i+l

where ai,j is the (j, j)th entry of matrix A, and we used the fact thatmatrix A is symmetric. Consider the eigenvalue decomposition ofmatrix A:

A: KAO 0] PTAP,

where Ai, i C {1, 2} are the eigenvalues of matrix A. For e(t) :=PTe(t), and v(t) := PTV(t), the error at the (t + I)th iteration is

(7)P11(1 A\(1\ tj A-

J= iil

-log(1- jxl) and 2A1 < i < t, it follows from

Here, we have chosen a(t) = t In the next section, we derive aconvergence condition on -y(t) which will be used to bound the totalcommunication energy.

3. A CONVERGENCE CONDITION AND REQUIREDCOMMUNICATION ENERGY

II 'y(k), where Amin, isk=j+l

t

fj=i=+El( < (8)

The mean square error at the (t + I)th iteration can be written as

E[e(t+12t(i2A([(2 ( i+l ) %V(i))J

the smallest eigenvalue ofthe matrix A, and II-y(k) := 1 for i > j.k=i

(9)(b) E( 1 ) +2(l )2 2E,7=1

III- 190

at node S2:x2(t + 1)

For h(x) = -

Lemma 1 that:

t @ r i+ l 8TiDefine ca(j,t) := Y, t+ J

i +1I

~J

XI (t) Ig, (XI (t + 1), x 2 (t))2 t+l 1 2

where o(j, t) is defined as (tV+)1 I1 Iy(k) and u_2 iSi=,}

+=j+i

variance of channel noise. In preceding derivation, step (a) followsfrom equation (8). Step (b) is due to the fact that vi (i) and V2 (i), thecomponents of vector v(i), are uncorrelated random variables withidentical distribution. Equation (9) completes the proof of part (a).

Proof (b) Similar to equation (7), the communication energy can bewritten in terms of the initial values of local variables, the accumu-lated channel noise, and the optimum point where the latter is thedominant part. Therefore, the total communication energy for nodesS1 and S2 is

T

Ecom(T) = E E llx(t) -'(t)x(t -1)11 2t=l

T 2

> (bi) (t=l 1=1

t 2 Alt2( j >)Al j = i±I

2

+1where b [bn, b2i Ptb In this equation, the inner summationcan be wriften using equation (8):

t-2

t-2 fl (1E1=i+li=l I4

1Al

j-1

t-2

C2 E(i + 1)A1 -1i=l

(t -1)A8 (1),

where C2 is a positive constant. Therefore, we obtain

T 2

Ecom (T) > EE(B> )2 1t=l 1=1

-y(t)-t

(10)

where Di and B1, 1 E {1, 2}, are positive constants. In part (a), weproved that under condition (6), we have

V e > 0, 3 t, : E[Ile(t, + 1) 112] < . (1

To prove part (b), it is enough to show that for some constant C3,

e Ecom(tE) > E [Ile(t. + 1)112] Ecom(te) > C3 > 0.

Using the Cauchy-Schwartz inequality and equations (9) and (11),we obtain

zc(i,t~) < 1 ((i, tE))2< C4 E (12)

where C4 is a positive constant. It follows from equation (9) and

(10) that

Ecom(t,)E [I e(t, + 1) 12]2 t,

> E(Bi)2 E: (1 -a(i) -Bi~~~~~~~~~I

11 i=l1 i+

2 t t te

(b) EBl 1- ()1=19 i=l ~j=i

Lc Bi (E i+ 1

Di 2 tc2 (c, (j, t))2i1

2

)) cv(iCt) 2

i+ 1 ki )

-DAC4Vc

> EB (0(1)- DiC4 E)2 (d) C3.1=1

In this equation, step (a) follows from the Cauchy-Schwartz inequal-ity. Step (b) is due to equation (12), and definition of o(i, t,). Step(c) is the result of changing the order of summation and step (d)holds for small enough e.

3.1. Extension to the Multiple Node Case

Under some additional assumptions, Theorem 1 holds in the mul-tiple node case. Assume that there exists a communication link be-tween nodes Si and Sj when aij 6 0. Furthermore, assume thatnodes operation are synchronized. Then, the same proof of Section3 applies here with minor changes in part (a) where the accumulatedchannel noise becomes

n t t

[v(t)]l =- E al.>E 71 -(k)nml(i) ,1 E {1, ... ,n}.m I ,mXl i= 1 =i+

Moreover, the same derivation of equation (9) applies and the meansquare error can be written as

2 1 2X r>in n)22

tZjt) 2

E[jje(t+1)jj I =3 t+l)~+ E: (ak,)1 l2kE(gitk,1=1,k71 1

4. OPTIMUM ENERGY DESIGNS

In this section, we prove that y* (t) := eCp(_t-q), 0 < q < 0.5attains the minimum energy bound. For brevity, we omit the de-tail of derivations here. It follows from the definition of o(i, t) andLemma 1 that:

Oc(j, t) f exp (- E kq

< t (i+exp(()+ )Je;( 1 ") d)< +

where the third step is due to the Holder's inequality. Since q isless than .5, the convergence condition in Theorem l(a) is satis-fied. Moreover, the mean square error decreases asymptotically atthe rate of t2q- 1 (equation (9)). Similar to equations (7), (9), we canshow the upper bound on communication energy grows at the rate oft2q-1 for -y* (t). This result along with Theorem l(b) prove that thecommunication energy required to compute e-minimizer of f (x) inmean square sense grows at the rate of (E-1).

III- 191

-= = y(t)= exp (-t

_ f(t)= exp (1-t ) -m (t)= exp (Mt )

= 7(t)= exp (-t)

Itteration

Fig. 1. Mean square error in x4

a y(t)= exp (-t )_ 7(t)= exp -t )

Itteration

Fig. 2. The average of the communication energy

5. SIMULATION RESULTSTo illustrate the concept, a simple two dimensional example is con-sidered. The cost function is a quadratic convex function with A=

2.1 211], b = [_I, ]T, and aminimum point at x=

[-10,10]T; the channel noise is additive white gaussian noise withvariance one. We also consider exp(_t-q) for q C{1/3,1/10} insimulations. Figure1 shows the mean square error in x1 averagedover 20 runs. This figure confirms that the mean square error de-creases at the rate oft2q- 1 when the accumulated channel noise be-comes the dominant source of error. Therefore, a higher q results in alower convergence rate as well as a lower increase of communicationenergy (Figure 2). Furthermore, the mean square error decreases lin-early with the communication energy (Figure 3). In other words, thatthe communication energy required to compute 6-minimizer of f (x)in mean square sense grows at the rate of E)(C1). These resultsagree with our theoretical analysis presented in pervious sections.

~~~~~~~~~~C mr n at or E .e yA

aveg.. . .e. . . ...... o. .m.

ct .. ...,\.k.......::: .::xx ....: ... .: :..::..... :.:.::::.:6.CONCLUSIONANDF E . ..W O Rratic. cost fun n in n..e.e... .ne . We c s.i~~~~~~~~~~~~~...... ......................= ...... ...........w1_L..... ...

lg.. ..... ... .i

40 .. .... .... .... ..mented ui certain.......................................liner..amascm.It.is.show..n.................

10.. .. .. .. ... .... ... .... .. ... .. ... .. .... ... .... ..... ............ ........... ............................ ..................... ................................~~~~~~~~~~~. .......... ................ ..... ................... ......

1 2 3 4 5 6

10 10 lOlO 10

thatnthencommunicationerg ana

Fig.3. Mean squareerrorin smversus theaverage of the communi-(catio 1ned cw

6. CONCLUSION AND FUTURE WORK

Westudied the problem ofdistributed optimization of a general nond-ratic cost function in anenergy-constrained network. We consid-ereda class of distributed stochastic gradient type algorithms imple-

mented using certain linear analog messaging schemes. It is shown

thatthe communication energy to obtain ane-minimizer of a cost

function in the mean square sense must grow at least at the rate of

Q((-').We derived specific designs which attain the minimum

energy bound, and confirmed our theoretical analysis by numeri-

calsimulations. The generalization of this work to a general non-

quadratic cost function is possible and is a subject of ongoing re-

search.

Acknowledgement: The authors gratefully acknowledge the assis-tance of Mikalai Kisialiou in the preparation of this manuscript.

7. REFERENCES

[1] D. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Com-putation NumericalMethods. Englewood Cliffs, NJ: Prentice-Hall, 1989.

[2] M. W. Carter, H. H. Jin, M. A. Saunders, andY. Ye, "Spaseloc:An adaptive subproblem algorithm for scalable wireless sensornetwork localization," SIAMJournal on Optimization, acceptedMarch 2006.

[3] A. Olshevsky and J. N. Tsitsiklis, "Convergence speed in dis-tributed consensus and averaging," in 45th IEEE Conference onDecision and Control. San Diego, CA, USA: IEEE, December2006.

[4] H. Robbins and S. Monro, "A stochastic approximationmethod," Annals ofMathematical Statistics,vol. 22, no. 3, pp.400-407, 1951.

[5] J. N. Tsitsiklis and Z. Luo, "Communication complexity of con-

vex optimization," Journal ofComplexity, vol. 3, no. 3, pp. 231-243, 1987.

[6] L. Xiao and S. Boyd, "Optimal scaling of a gradient method fordistributed resource allocation," To appear in Journal of Opti-mization Theory and Applications, vol. 129, no. 3, pp. 469-488,2006.

III- 192

IO

department mn {razaoo07, luozq}lumnpeople.ece.umn.edu/~luozq/assets/pdf/publications... · sequel,...

Documents