dynamic structural equation models for social network ... · dynamic structural equation models for...

arX

iv:1

309.

6683

v2 [

cs.S

I] 2

7 S

ep 2

013

Dynamic Structural Equation Models forSocial Network Topology Inference†

Brian Baingana, Student Member, IEEE, Gonzalo Mateos, Member, IEEE,and Georgios B. Giannakis, Fellow, IEEE∗

Abstract—Many real-world processes evolve in cascades overcomplex networks, whose topologies are often unobservableandchange over time. However, the so-termed adoption times whenblogs mention popular news items, individuals in a communitycatch an infectious disease, or consumers adopt a trendy electron-ics product are typically known, and are implicitly dependent onthe underlying network. To infer the network topology, a dynamicstructural equation model is adopted to capture the relationshipbetween observed adoption times and the unknown edge weights.Assuming a slowly time-varying topology and leveraging thesparse connectivity inherent to social networks, edge weightsare estimated by minimizing a sparsity-regularized exponentially-weighted least-squares criterion. To this end, solvers with comple-mentary strengths are developed by leveraging (pseudo) real-timesparsity-promoting proximal gradient iterations, the improvedconvergence rate of accelerated variants, or reduced computa-tional complexity of stochastic gradient descent. Numerical testswith both synthetic and real data demonstrate the effectivenessof the novel algorithms in unveiling sparse dynamically-evolvingtopologies, while accounting for external influences in theadoptiontimes. Key events in the recent succession of political leadershipin North Korea, explain connectivity changes observed in theassociated network inferred from global cascades of onlinemedia.

Index Terms—Structural equation model, dynamic network,social network, contagion, sparsity.

I. I NTRODUCTION

Networks arising in natural and man-made settings providethe backbone for the propagation ofcontagions such as thespread of popular news stories, the adoption of buying trendsamong consumers, and the spread of infectious diseases [11],[36]. For example, a terrorist attack may be reported withinminutes on mainstream news websites. An information cascadeemerges because these websites’ readership typically includesbloggers who write about the attack as well, influencing theirown readers in turn to do the same. Although the timeswhen “nodes” get infected are often observable, the underlyingnetwork topologies over which cascades propagate are typicallyunknown and dynamic. Knowledge of the topology playsa crucial role for several reasons e.g., when social mediaadvertisers select a small set of initiators so that an onlinecampaign can go viral, or when healthcare initiatives wish to

† Work in this paper was supported by the NSF ECCS Grants No. 1202135and No. 1343248, and the NSF AST Grant No. 1247885. Parts of the paper willappear in theProc. of the of 5th Intl. Workshop on Computational Advancesin Multi-Sensor Adaptive Processing, Saint Martin, December 15-18, 2013.

∗ The authors are with the Dept. of ECE and the DigitalTechnology Center, University of Minnesota, 200 Union Street SE,Minneapolis, MN 55455. Tel/fax: (612)626-7781/625-4583;Emails:{baing011,mate0058,georgios}@umn.edu

infer hidden needle-sharing networks of injecting drug users.As a general principle, network structural information canbeused to predict the behavior of complex systems [16], suchas the evolution and spread of information pathways in onlinemedia underlying e.g., major social movements and uprisingsdue to political conflicts [35].

Inference of networks using temporal traces of infectionevents has recently become an active area of research. Ac-cording to the taxonomy in [16, Ch. 7], this can be viewed asa problem involving inference ofassociation networks. Twoother broad classes of network topology identification prob-lems entail (individual) link prediction, or, tomographicinfer-ence. Several prior approaches postulate probabilistic modelsand rely on maximum likelihood estimation (MLE) to inferedge weights as pairwise transmission rates between nodes[34], [27]. However, these methods assume that the networkdoes not change over time. A dynamic algorithm has beenrecently proposed to infer time-varying diffusion networksby solving an MLE problem via stochastic gradient descentiterations [35]. Although successful experiments on large-scaleweb data reliably uncover information pathways, the estimatorin [35] does not explicitly account for edge sparsity prevalentin social and information networks. Moreover, most priorapproaches only attribute node infection events to the networktopology, and do not account for the influence of externalsources such as a ground crew for a mainstream media website.

The propagation of a contagion is tantamount tocausal ef-fects or interactions being excerted among entities such asnewsportals and blogs, consumers, or people susceptible to beinginfected with a contagious disease. Acknowledging this view-point, structural equation models (SEMs) provide a generalstatistical modeling technique to estimate causal relationshipsamong traits; see e.g., [15], [32]. These directional effects areoften not revealed by standard linear models that leverage sym-metric associations between random variables, such as thoserepresented by covariances or correlations, [26], [12], [17], [2].SEMs are attractive because of their simplicity and abilitytocapture edge directionalities. They have been widely adoptedin many fields, such as economics, psychometrics [28], socialsciences [13], and genetics [6], [22]. In particular, SEMs haverecently been proposed forstatic gene regulatory networkinference from gene expression data; see e.g., [6], [23] andreferences therein. However, SEMs have not been utilized totrack the dynamics of causal effects among interacting nodes,or, to infer the topology of time-varying directed networks.

In this context, the present paper proposes adynamic SEMto account for directed networks over which contagions prop-agate, and describes how node infection times depend on

http://arxiv.org/abs/1309.6683v2

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING (SUBMITTED) 1

both topological (causal) and external influences. Topologicalinfluences are modeled in Section II as linear combinations ofinfection times of other nodes in the network, whose weightscorrespond to entries in the time-varying asymmetric adjacencymatrix. Accounting for external influences is well motivated bydrawing upon examples from online media, where establishednews websites depend more on on-site reporting than blogreferences. External influence data is also useful for modelidentifiability, since it has been shown necessary to resolvedirectional ambiguities [4]. Supposing the network variesslowly with time, parameters in the proposed dynamic SEMare estimated adaptively by minimizing a sparsity-promotingexponentially-weighted least-squares (LS) criterion (SectionIII-A). To account for the inherently sparse connectivity ofsocial networks, anℓ1-norm regularization term that promotessparsity on the entries of the network adjacency matrix isincorporated in the cost function; see also [1], [2], [7], [18].

A novel algorithm to jointly track the network’s adjacencymatrix and the weights capturing the level of external influ-ences is developed in Section III-B, which minimizes the re-sulting non-differentiable cost function via a proximal-gradient(PG) solver; see e.g., [5], [10], [31]. The resulting dynamiciterative shrinkage-thresholding algorithm (ISTA) is provablyconvergent, and offers parallel, closed-form, and sparsity-promoting updates per iteration. Proximal-splitting algorithmssuch as ISTA have been successfully adopted for varioussignal processing tasks [9], and for parallel optimization[8].Further algorithmic improvements are outlined in Section IV.These include enhancing the algorithms’ rate of convergencethrough Nesterov’s acceleration techniques [5], [29], [30](Section IV-A), and also adapting it for real-time operation(Section IV-B). When minimal computational complexity isat a premium, a stochastic gradient descent (SGD) algorithmis developed in Section IV-C, which adaptively minimizes aninstantaneous (noisy) approximation of the ensemble LS cost.Throughout, insightful and useful extensions to the proposedalgorithms that are not fully developed due to space limitationsare highlighted as remarks.

Numerical tests on synthetic network data demonstrate thesuperior error performance of the developed algorithms, andhighlight their merits when compared to the sparsity-agnosticapproach in [35] (Section V-A). Experiments in Section V-Binvolve real temporal traces of popular global events that prop-agated on news websites and blogs in 2011 [21]. Interestingly,topologies inferred from cascades associated to the meme “KimJong-un” exhibit an abrupt increase in the number of edgesfollowing the appointment of the new North Korean ruler.

Notation. Bold uppercase (lowercase) letters will denote ma-trices (column vectors), while operators(·)⊤, λmax(·), anddiag(·) will stand for matrix transposition, maximum eigen-value, and diagonal matrix, respectively. TheN ×N identitymatrix will be represented byIN , while 0N will denote theN × 1 vector of all zeros, and0N×P := 0N0⊤

P . The ℓpand Frobenius norms will be denoted by‖ · ‖p, and ‖ · ‖F ,respectively.

Fig. 1. Dynamic network observed across several time intervals. Note that fewedges are added/removed in the transition fromt = 1 to t = 2 (slowly time-varying network), and edges are depicted as undirected herefor convenience.

II. N ETWORK MODEL AND PROBLEM STATEMENT

Consider a dynamic network withN nodes observed overtime intervalst = 1, . . . , T , whose abstraction is a graph withtopology described by an unknown, time-varying, and weightedadjacency matrixAt ∈ R

N×N . Entry (i, j) of At (henceforthdenoted byatij ) is nonzero only if a directed edge connectsnodesi and j (pointing from j to i) during the time intervalt, as illustrated in the8-node network in Fig. 1. As a result,one in general hasatij 6= atji, i.e., matrixAt is generally non-symmetric, which is suitable to model directed networks. Forinstance, ifi denotes a news blog maintained by a journalismstudent, whereasj represents the web portal of a mainstreamnewspaper, then it is likely thatatij ≫ atji ≈ 0 for thosetwhere atij 6= 0. Probably, the aforementioned directionalitywould have been reversed during Nov.-Dec. 2010, ifi insteadrepresents the Wikileaks blog. Note that the model tacitlyassumes that the network topology remains fixed during anygiven time intervalt, but can change across time intervals.

SupposeC contagions propagate over the network, and thedifference between infection time of nodei by contagionc and the earliest observation time is denoted byytic. Inonline media,ytic can be obtained by recording the time whenwebsite i mentions news itemc. For uninfected nodes atslot t, ytic is set to an arbitrarily large number. Assume thatthe susceptibilityxic of node i to external (non-topological)infection by contagionc is known and time invariant over theobservation interval. In the web context,xic can be set to thesearch engine rank of websitei with respect to (w.r.t.) keywordsassociated withc.

The infection time of nodei during intervalt is modeledaccording to the followingdynamic structural equation model(SEM)

ytic =∑

j 6=i

atijytjc + btiixic + etic (1)

where btii captures the time-varying level of influence ofexternal sources, andetic accounts for measurement errors andunmodeled dynamics. It follows from (1) that ifatij 6= 0, thenytic is affected by the value ofytjc. Rewriting (1) for the entirenetwork leads to the vector model

ytc = Atyt

c +Btxc + etc (2)

where theN × 1 vector ytc := [yt1c, . . . , y

tNc]

⊤ collects thenode infection times by contagionc during interval t, andBt := diag(bt11, . . . , b

tNN). Similarly, xc := [x1c, . . . , xNc]

⊤

and etc := [et1c, . . . , etNc]

⊤. Collecting observations for allC


contagions yields the dynamic matrix SEM

Yt = AtYt +BtX+ Et (3)

whereYt := [yt1, . . . ,y

tC ], X := [x1, . . . ,xC ], and Et :=

[et1, . . . , etC ] are allN×C matrices. Note that the same network

topologyAt is adopted for all contagions, which is suitablee.g., when different information cascades are formed arounda common meme or trending (news) topic in the Internet; seealso the real data tests in Section V-B.

Given {Yt}Tt=1 andX, the goal is to track the underlyingnetwork topology{At}Tt=1 and the effect of external influences{Bt}Tt=1. To this end, the novel algorithm developed in the nextsection assumes slow time variation of the network topologyand leverages the inherent sparsity of edges that is typicalofsocial networks.

III. T OPOLOGYTRACKING ALGORITHM

This section deals with a regularized LS approach to estimat-ing {At,Bt} in (3). In a static setting with all measurements{Yt}Tt=1 available, one solves the batch problem

{A, B} = arg minA,B

1

2

T∑

t=1

‖Yt −AYt −BX‖2F + λ‖A‖1

s. to aii = 0, bij = 0, ∀i 6= j (4)

where‖A‖1 :=∑

i,j |aij | is a sparsity-promoting regulariza-tion, andλ > 0 controls the sparsity level ofA. Absence of aself-loop at nodei is enforced by the constraintaii = 0, whilehavingbij = 0, ∀i 6= j, ensures thatB is diagonal as in (2).Remark 1 (MLE versus LS): If the errorsetic ∼ N (0, σ2)in (1) are modeled as independent and identically distributed(i.i.d.) Gaussian random variables, the sparsity-agnostic MLEsof the SEM parameters are obtained by solving

minA,B

T∑

t=1

[

1

2‖Yt −AYt −BX‖2F + Cσ2 log | det(I−A)|

]

(5)subject to the constraints in (4) [6]. Different from regressionlinear models, LS is not maximum likelihood (ML) when itcomes to Gaussian SEMs. Sparsity can be accounted for inthe ML formulation throughℓ1-norm regularization. Here, theLS approach is adopted because of its universal applicabilitybeyond Gaussian models, and because MLE of SEM parame-ters gives rise to non-convex criteria [cf. (5)].

A. Exponentially-weighted LS estimator

In practice, measurements are typically acquired in a sequen-tial manner and the sheer scale of social networks calls forestimation algorithms with minimal storage requirements.Re-cursive solvers enabling sequential inference of the underlyingnetwork topology are thus preferred. Moreover, introducing a“forgetting factor” that assigns more weight to the most recentresiduals makes it possible to track slow temporal variationsof the topology. Note that the batch estimator (4) yields singleestimates{A, B} that best fit the data{Yt}Tt=1 andX overthe whole measurement horizont = 1, . . . , T , and as such (4)neglects potential network variations across time intervals.

For t = 1, . . . , T , the sparsity-regularized exponentially-weighted LS estimator (EWLSE)

{At, Bt} = arg minA,B

1

2

t∑

τ=1

βt−τ‖Yτ −AYτ −BX‖2F

+λt‖A‖1s. to aii = 0, bij = 0, ∀i 6= j (6)

whereβ ∈ (0, 1] is the forgetting factor that forms estimates{At, Bt} using all measurements acquired until timet. When-everβ < 1, past data are exponentially discarded thus enablingtracking of dynamic network topologies. The first summandin the cost corresponds to an exponentially-weighted movingaverage (EWMA) of the squared model residuals norms. TheEWMA can be seen as an average modulated by a slidingwindow of equivalent length1/(1 − β), which clearly growsas β → 1. In the so-termed infinite-memory setting wherebyβ = 1, (6) boils down to the batch estimator (4). Notice thatλt

is allowed to vary with time in order to capture the generallychanging edge sparsity level. In a linear regression context, arelated EWLSE was put forth in [1] for adaptive estimationof sparse signals; see also [18] for a projection-based adaptivealgorithm.

Before moving on to algorithms, a couple of remarks are inorder.Remark 2 (Modeling slow network variations via sparsity):To explicitly model slow topological variations across timeintervals, a viable approach is to include an additionalregularization termµt‖A − At−1‖1 in the cost of (6). Thisway, the estimator penalizes deviations of the current topologyestimate relative to its immediate predecessorAt−1. Throughthe tuning parameterµt, one can adjust how smooth are theadmissible topology variations from interval to interval.Witha similar goal but enforcing temporal smoothness via kernelswith adjustable bandwidth, anℓ1-norm-regularized logisticregression approach was put forth in [17].Remark 3 (Selection ofλt): Selection of the (possibly time-varying) tuning parameterλt is an important aspect of regular-ization methods such as (6), becauseλt controls the sparsitylevel of the inferred network and how its structure may changeover time. For sufficiently large values ofλt one obtainsthe trivial solution At = ON×N , while increasingly moredense graphs are obtained asλt → 0. An increasingλt

will be required for accurate estimation over extended time-horizons, since forβ ≈ 1 the norm of the LS term in (6)grows due to noise accumulation. This way the effect of theregularization term will be downweighted unless one increasesλt at a suitable rate, for instance proportional to

√σ2t as

suggested by large deviation tail bounds when the errors areassumedetic ∼ N (0, σ2), and the problem dimensionsN,C, Tare sufficiently large [1], [25], [26]. In the topology trackingexperiments of Section V, a time-invariant value ofλ isadopted and typically chosen via trial and error to optimizethe performance. This is justified since smaller values ofβ areselected for tracking network variations, which also impliesthat past data (and noise) are discarded faster, and the normofthe LS term in (6) remains almost invariant. As future researchit would be interesting to delve further into the choice ofλt


using model selection techniques such as cross-validation[6],Bayesian information criterion (BIC) scores [17], or the mini-mum description length (MDL) principle [33], and investigatehow this choice relates to statistical model consistency inadynamic setting.

B. Proximal gradient algorithm

Exploiting the problem structure in (6), a proximal gradient(PG) algorithm is developed in this section to track the networktopology; see [31] for a comprehensive tutorial treatment onproximal methods. PG methods have been popularized forℓ1-norm regularized linear regression problems, through theclass of iterative shrinkage-thresholding algorithms (ISTA); seee.g., [10], [39]. The main advantage of ISTA over off-the-shelfinterior point methods is its computational simplicity. Iterationsboil down to matrix-vector multiplications involving the regres-sion matrix, followed by a soft-thresholding operation [14, p.93].

In the sequel, an ISTA algorithm is developed for the sparsityregularized dynamic SEM formulation (6) at timet. Basedon this module, a (pseudo) real-time algorithm for trackingthe dynamically-evolving network topology over the horizont = 1, . . . , T is obtained as well. The resulting algorithm’smemory storage requirement and computational cost per datasample{Yt,X} does not grow witht.Solving (6) for a single time interval t. Introducing theoptimization variableV := [A B], observe that the gra-dient of f(V) := 1

2

∑tτ=1 β

t−τ‖Yτ − AYτ − BX‖2F isLipschitz continuous with a (minimum) Lipschitz constantLf = λmax(

∑tτ=1 β

t−τ [(Yτ )⊤ (X)⊤]⊤[(Yτ )⊤ (X)⊤]), i.e.,‖∇f(V1) − ∇f(V2)‖ ≤ Lf‖V1 − V2‖, ∀ V1, V2 in thedomain off . The Lipschitz constant is time varying, but thedependency ont is kept implicit for notational convenience.Instead of directly optimizing the cost in (6), PG algorithmsminimize a sequence of overestimators evaluated at judiciouslychosen pointsU (typically the current iterate, or a linearcombination of the two previous iterates as discussed in SectionIV-A). From the Lipschitz continuity of∇f , for any V andU in the domain off , it holds thatf(V) ≤ Qf (U,V) :=f(U) + 〈∇f(U),V −U〉 + (Lf/2)‖V −U‖2F . Next, defineg(V) := λt‖A‖1 and form the quadratic approximation of thecostf(V) + g(V) [cf. (6)] at a given pointU

Q(V,U) := Qf (V,U) + g(V)

=Lf

2‖V−G(U)‖2F + g(V)

+ f(U)− ‖∇f(U)‖2F2Lf

(7)

where G(U) := U − (1/Lf)∇f(U), and clearlyf(V) +g(V) ≤ Q(V,U) for any V and U. Note that G(U)corresponds to a gradient-descent step taken fromU, with step-size equal to1/Lf .

With k = 1, 2, . . . denoting iterations, PG algorithms setU := V[k−1] and generate the following sequence of iterates

V[k] := argminV

Q(V,V[k − 1])

= argminV

{

Lf

2‖V−G(V[k − 1])‖2F + g(V)

}

(8)

where the second equality follows from the fact that the lasttwo summands in (7) do not depend onV. The optimiza-tion problem (8) is known as theproximal operator of thefunction g/Lf evaluated atG(V[k − 1]), and is denotedas proxg/Lf

(G(V[k − 1])). Henceforth adopting the notationG[k − 1] := G(V[k − 1]) for convenience, the PG iterationscan be compactly rewritten as

V[k] = proxg/Lf(G[k − 1]). (9)

A key element to the success of PG algorithms stems from thepossibility of efficiently solving the sequence of subproblems(8), i.e., evaluating the proximal operator. Specializingto (6),note that (8) decomposes into

A[k] := argminA

{

Lf

2‖A−GA[k − 1]‖2F + λt‖A‖1

}

= proxλt‖·‖1/Lf(GA[k − 1]) (10)

B[k] := argminB

{

‖B−GB[k − 1]‖2F}

= GB[k − 1] (11)

subject to the constraints in (6) which so far have been leftimplicit, and G := [GA GB]. Because there is no regular-ization on the matrixB, the corresponding update (11) boils-down to a simple gradient-descent step. LettingSµ(M) with(i, j)-th entry given by sign(mij)max(|mij |−µ, 0) denote thesoft-thresholding operator, it follows that proxλt‖·‖1/Lf

(·) =Sλt/Lf

(·), e.g., [10], [14]; so that

A[k] = Sλt/Lf(GA[k − 1]). (12)

What remains now is to obtain expressions for the gradientof f(V) with respect toA and B, which are required toform the matricesGA and GB . To this end, note that byincorporating the constraintsaii = 0 and bij = 0, ∀j 6= i,i = 1, . . .N, one can simplify the expression off(V) as

f(V) :=1

2

t∑

τ=1

N∑

i=1

βt−τ‖(yτi )

⊤ − a⊤−iYτ−i − biix

⊤i ‖2F (13)

where (yτi )

⊤ and x⊤i denote thei-th row of Yτ and X,

respectively; whilea⊤−i denotes the1×(N−1) vector obtainedby removing entryi from thei-th row ofA, and likewiseYτ

−i

is the (N − 1)× C matrix obtained by removing rowi fromYτ . It is apparent from (13) thatf(V) is separable across thetrimmed row vectorsa⊤−i, and the scalar diagonal entriesbii,i = 1, . . . , N . The sought gradients are readily obtained as

∇a−if(V) = −

t∑

τ=1

βt−τYτ−i(y

τi − (Yτ

−i)⊤a−i − xibii)

∇biif(V) = −t

∑

τ=1

βt−τ ((yτi )

⊤ − a⊤−iYτ−i − biix

⊤i )xi.

At time interval t, consider the data-related EWMAsΣt :=∑t

τ=1 βt−τYτ (Yτ )⊤, σt

i :=∑t

τ=1 βt−τYτyτ

i , and Yt :=∑t

τ=1 βt−τYτ . With these definitions, the gradient expres-

sions fori = 1, . . . , N can be compactly expressed as

∇a−if(V) = Σt

−ia−i + Yt−ixibii − σ

t−i (14)

∇biif(V) = a⊤−iYt−ixi +

1− βt

1− βbii‖xi‖22 − (yτ

i )⊤xi (15)


where(yti)

⊤ denotes thei-th row ofYt, Yt−i is the(N−1)×C

matrix obtained by removing rowi from Yt, andΣt−i is the

(N − 1)× (N − 1) matrix obtained by removing thei-th rowand i-th column fromΣt.

From (11)-(12) and (14)-(15), the parallel ISTA iterations

∇a−if [k] = Σt

−ia−i[k] + Yt−ixibii[k]− σ

t−i (16)

∇biif [k] = a⊤−i[k]Yt−ixi +

(1− βt)

1− βbii[k]‖xi‖22 − (yt

i)⊤xi

(17)

a−i[k + 1] = Sλt/Lf

(

a−i[k]− (1/Lf)∇a−if [k]

)

(18)

bii[k + 1] = bii[k]− (1/Lf)∇biif [k] (19)

are provably convergent to the globally optimal solution{At, Bt} of (6), as per the general convergence results avail-able for PG methods and ISTA in particular [10], [31].

Computation of the gradients in (16)-(17) requires onematrix-vector mutiplication byΣt

−i and one byYt−i, in

addition to three vector inner-products, plus a few (negligiblycomplex) scalar and vector additions. Both the update ofbii[k+1] as well as the soft-thresholding operation in (18) entailnegligible computational complexity. All in all, the simplicityof the resulting iterations should be apparent. Per iteration,the actual rows of the adjacency matrix are obtained by zero-padding the updateda−i[k], namely setting

a⊤i [k] = [a−i,i1[k] . . . a−i,ii−1[k] 0 a−i,ii[k] . . . a−i,iN [k]].(20)

This way, the desired SEM parameter estimates at timet are given by At = [a⊤1 [k], . . . , a

⊤N [k]]⊤ and Bt =

diag(b11[k], . . . , bNN [k]), for k large enough so that conver-gence has been attained.Remark 4 (General sparsity-promoting regularization):Beyond g(A) = λt‖A‖1, the algorithmic framework herecan accommodate more generalstructured sparsity-promotingregularizersγ(A) as long as the resulting proximal operatorproxγ/Lf

(·) is given in terms of scalar or (and) vectorsoft-thresholding operators. In addition to theℓ1-norm (Lassopenalty), this holds e.g., for the sum of theℓ2-norms of vectorswith groups of non-overlapping entries ofA (group Lassopenalty [40]), or, a linear combination of the aforementionedtwo – the so-termed hierarchical Lasso penalty that encouragessparsity across and within the groups defined overA [38].These types of regularization could be useful if one e.g., has apriori knowledge that some clusters of nodes are more likelyto be jointly (in)active [35].Solving (6) over the entire time horizon t = 1, . . . , T .To track the dynamically-evolving network topology, one cango ahead and solve (6) sequentially for eacht = 1, . . . , Tas data arrive, using (16)-(19). (The procedure can also beadopted in a batch setting, when all{Yt}Tt=1 are availablein memory.) Because the network is assumed to vary slowlyacross time intervals, it is convenient to warm-restart theISTAiterations, that is, at timet initialize {A[0],B[0]} with theprevious solution{At−1, Bt−1}. Since the sought estimatesare expected to be close to the initial points, one expectsconvergence to be attained after few iterations.

To obtain the new SEM parameter estimates via (16)-(19),it suffices to update (possibly)λt and the Lipschitz constant

Algorithm 1 Pseudo real-time ISTA for topology tracking

Require: {Yt}Tt=1, X, β.1: Initialize A0 = 0N×N , B0 = Σ0 = IN , Y0 = 0N×C , λ0.

2: for t = 1, . . . , T do3: Updateλt, Lf andΣt, Yt via (21)-(22).4: Initialize A[0] = At−1, B[0] = Bt−1, and setk = 0.5: while not convergeddo6: for i = 1 . . .N (in parallel)do7: ComputeΣt

−i andYt−i.

8: Form gradients ata−i[k] andbii[k] via (16)-(17).9: Updatea−i[k + 1] via (18).

10: Updatebii[k + 1] via (19).11: Updateai[k + 1] via (20).12: end for13: k = k + 1.14: end while15: return At = A[k], Bt = B[k].16: end for

Lf , as well as the data-dependent EWMAsΣt (σti is the i-th

column ofΣt), and Yt. Interestingly, the potential growing-memory problem in storing the entire history of data{Yt}Tt=1

can be avoided by performing the recursive updates

Σt = βΣt−1 +Yt(Yt)⊤ (21)

Yt = βYt−1 +Yt. (22)

Note that the complexity in evaluating the Gram matrixYt(Yt)⊤ dominates the per-iteration computational cost ofthe algorithm. To circumvent the need of recomputing the Lips-chitz constant per time interval (that in this case entails findingthe spectral radius of a data-dependent matrix), the step-size1/Lf in (18)-(19) can be selected by a line search [31]. Onepossible choice is the backtracking step-size rule in [5], underwhich convergence of (14)-(19) to{At, Bt} can be establishedas well.

Algorithm 1 summarizes the steps outlined in this sectionfor tracking the dynamic network topology, given temporaltraces of infection events{Yt}Tt=1 and susceptibilitiesX. It istermedpseudo real-time ISTA, since in principle one needs torun multiple (inner) ISTA iterations till convergence per timeinterval t = 1, . . . , T . This will in turn incur an associateddelay, that may (or may not) be tolerable depending on thespecific network inference problem at hand. Nevertheless,numerical tests indicate that in practice 5-10 inner iterationssuffice for convergence; see also Fig. 2 and the discussion inSection IV-B.Remark 5 (Comparison with the ADMM in [3]): In a con-ference precursor to this paper [3], an alternating-directionmethod of multipliers (ADMM) algorithm was put forth toestimate the dynamic SEM model parameters. While the basicglobal structure of the algorithm in [3] is similar to Algorithm1, ADMM is adopted (instead of ISTA) to solve (6) per timet = 1, . . . , T . To updatea−i[k+ 1], ADMM iterations requireinverting the matrixΣt

−i+IN−1, that could be computationallydemanding for very large networks. On the other hand, Algo-


rithm 1 is markedly simpler and more appealing for larger-scaleproblems.

IV. A LGORITHMIC ENHANCEMENTS AND VARIANTS

This section deals with various improvements to Algorithm1, that pertain to accelerating its rate of convergence and alsoadapting it for real-time operation in time-sensitive applica-tions. In addition, a stochastic-gradient algorithm useful whenminimal computational complexity is at a premium is alsooutlined.

A. Accelerated proximal gradient method and fast ISTA

In the context of sparsity-regularized inverse problems andgeneral non-smooth optimization, there have been severalrecent efforts towards improving the sublinear global rateofconvergence exhibited by PG algorithms such as ISTA; seee.g., [5], [29], [30] and references therein. Since for large-scaleproblems first-order (gradient) methods are in many cases theonly admissible alternative, the goal of these works has beento retain the computational simplicity of ISTA while markedlyenhancing its global rate of convergence.

Remarkable results in [30] assert that convergence speedupscan be obtained through the so-termedaccelerated (A)PGalgorithm. Going back to the derivations in the beginning ofSection IV-A, APG algorithms generate the following sequenceof iterates [cf. (8) and (9)]

V[k] = argminV

Q(V,U[k − 1]) = proxg/Lf(G(U[k − 1]))

where

U[k] := V[k − 1] +

(

c[k − 1]− 1

c[k]

)

(V[k − 1]−V[k − 2])

(23)

c[k] =1 +

√

4c2[k − 1] + 1

2. (24)

In words, instead of minimizing a quadratic approximationto the cost evaluated atV[k − 1] as in ISTA [cf. (8)], theaccelerated PG algorithm [a.k.a. fast (F)ISTA] utilizes a linearcombination of the previous two iterates{V[k−1],V[k−2]}.The iteration-dependent combination weights are functionofthe scalar sequence (24). FISTA offers quantifiable iterationcomplexity, namely a (worst-case) convergence rate guaranteeof O(1/

√ǫ) iterations to return anǫ-optimal solution measured

by its objective value (ISTA instead offersO(1/ǫ)) [5], [30].Even for general (non-)smooth optimization, APG algorithmshave been shown to be optimal within the class of first-order(gradient) methods, in the sense that the aforementioned worst-case convergence rate cannot be improved [29], [30].

The FISTA solver for (6) entails the following steps [cf.

Algorithm 2 Pseudo real-time FISTA for topology tracking

Require: {Yt}Tt=1, X, β.1: Initialize A0 = 0N×N , B0 = Σ0 = IN , Y0 = 0N×C , λ0.

2: for t = 1, . . . , T do3: Updateλt, Lf andΣt, Yt via (21)-(22).4: Initialize A[0] = A[−1] = At−1,B[0] = B[−1] =

Bt−1, c[0] = c[−1] = 1, and setk = 0.5: while not convergeddo6: for i = 1 . . .N (in parallel)do7: ComputeΣt

−i andYt−i.

8: Updatea−i[k] and bii[k] via (25)-(26).9: Form gradients ata−i[k] and bii[k] via (27)-(28).

10: Updatea−i[k + 1] via (29).11: Updatebii[k + 1] via (30).12: Updateai[k + 1] via (20).13: end for14: k = k + 1.15: Updatec[k] via (24).16: end while17: return At = A[k], Bt = B[k].18: end for

(16)-(19)]

a−i[k] := a−i[k] +

(

c[k − 1]− 1

c[k]

)

(a−i[k]− a−i[k − 1])

(25)

bii[k] := bii[k] +

(

c[k − 1]− 1

c[k]

)

(bii[k]− bii[k − 1])

(26)

∇a−if [k] = Σt

−ia−i[k] + Yt−ixibii[k]− σ

t−i (27)

∇biif [k] = a⊤−i[k]Yt−ixi +

(1− βt)

1− βbii[k]‖xi‖22 − (yt

i)⊤xi

(28)

a−i[k + 1] = Sλt/Lf

(

a−i[k]− (1/Lf)∇a−if [k]

)

(29)

bii[k + 1] = bii[k]− (1/Lf)∇biif [k] (30)

wherec[k] is updated as in (24). The overall (pseudo) real-timeFISTA for tracking the network topology is tabulated underAlgorithm 2. As desired, the computational complexity ofAlgorithms 1 and 2 is roughly the same. Relative to Algorithm1, the memory requirements are essentially doubled since onenow has to store the two prior estimates ofA andB, which arenevertheless sparse and diagonal matrices, respectively.Numer-ical tests in Section V suggest that Algorithm 2 exhibits thebestperformance when compared to Algorithm 1 and the ADMMsolver of [3], especially when modified to accommodate real-time processing requirements – the subject dealt with next.

B. Inexact (F)ISTA for time-sensitive operation

Additional challenges arise with real-time data collection,where analytics must often be performed “on-the-fly” as wellas without an opportunity to revisit past entries. Online opera-tion in delay-sensitive applications may not tolerate runningmultiple inner (F)ISTA iterations per time interval, so that


Algorithm 3 Real-time inexact FISTA for topology tracking

Require: {Yt}Tt=1, X, β.1: Initialize A[1] = A[0] = 0N×N ,B[1] = B[0] = Σ0 =

IN , Y0 = 0N×C , c[1] = c[0] = 1, λ0.2: for t = 1, . . . , T do3: Updateλt, Lf andΣt, Yt via (21)-(22).4: for i = 1 . . .N (in parallel)do5: ComputeΣt

−i andYt−i.

6: Updatea−i[t] and bii[t] via (25)-(26).7: Form gradients ata−i[t] and bii[t] via (27)-(28).8: Updatea−i[t+ 1] via (29).9: Updatebii[t+ 1] via (30).

10: Updateai[t+ 1] via (20).11: end for12: Updatec[t+ 1] via (24).13: return At = A[t+ 1], Bt = B[t+ 1].14: end for

convergence is attained for eacht as required by Algorithms1 and 2. This section touches upon an interesting tradeoff thatemerges with time-constrained data-intensive problems, wherea high-quality answer that is obtained slowly can be less usefulthan a medium-quality answer that is obtained quickly.

Consider for the sake of exposition a scenario where theunderlying network processes are stationary, or just piecewisestationary with sufficiently long coherence time for that matter.The rationale behind the proposed real-time algorithm hingesupon the fact that the solution of (6) for eacht = 1, . . . , T doesnot need to be super accurate in the aforementioned stationarysetting, since it is just an intermediate step in the outer loopmatched to the time-instants of data acquisition. This motivatesstopping earlier the inner iteration which solves (6) (cf. thewhile loop in Algorithms 1 and 2), possibly even after a singlesoft-thresholding step, as detailed in the real-time Algorithm3. Note that in this case the inner-iteration indexk coincideswith the time indext. A similar adjustment can be made tothe ISTA variant (Algorithm 1), and one can in general adopta less aggressive approach by allowing a few (not just one)inner-iterations pert.

A convergence proof of Algorithm 3 in a stationary networksetting will not be provided here, and is left as a future researchdirection. Still, convergence will be demonstrated next with theaid of computer simulations. For the infinite-memory case [cf.β = 1 in (6)] and the simpler ISTA counterpart of Algorithm3 obtained whenc[t] = 1, ∀t, it appears possible to adaptthe arguments in [24], [25] to establish that the resultingiterations converge to a minimizer of the batch problem (4).Inthe dynamic setting where the network is time-varying, thenconvergence is not expected to occur because of the continuousnetwork fluctuations. Still, as with adaptive signal processingalgorithms [37] one would like to establish that the trackingerror attains a bounded steady-state. These interesting andchallenging problems are subject of ongoing investigationandwill be reported elsewhere.

For synthetically-generated data according to the setup de-scribed in Section V-A, Fig. 2 shows the time evolution of Al-gorithm 2’s mean-square error (MSE) estimation performance.

0 200 400 600 800 100010

−4

10−3

10−2

10−1

100

time

MS

E

1 iteration5 iterations10 iterations15 iterations

Fig. 2. MSE (i.e.,∑

i,j(atij−atij )

2/N2) performance of Algorithm 2 versustime. For eacht, problem (6) is solved “inexactly” fork = 1 (Algorithm 3),5,10, and15 inner iterations. It is apparent thatk = 5 iterations suffice to attainconvergence to the minimizer of (6) pert, especially after a short transientwhere the warm-restarts offer increasingly better initializations.

For each time intervalt, (6) is solved “inexactly” after runningonly k = 1, 5, 10 and 15 inner iterations. Note that the casek = 1 corresponds to Algorithm 3. Certainlyk = 10 iterationssuffice for the FISTA algorithm to converge to the minimizerof (6); the curve fork = 15 is identical. Even withk = 5 theobtained performance is satisfactory for all practical purposes,especially after a short transient where the warm-restartsofferincreasingly better initializations. While Algorithm 3 showsa desirable convergent behavior, it seems that this example’snetwork coherence time oft = 250 time intervals is too shortto be tracked effectively. Still, if the network changes aresufficiently smooth as it occurs att = 750, then the real-timealgorithm is able to estimate the network reliably.

C. Stochastic-gradient descent algorithm

A stochastic gradient descent (SGD) algorithm is developedin this section, which operates in real time and can trackthe (slowly-varying) underlying network topology. Among allalgorithms developed so far, the SGD iterations incur the leastcomputational cost.

Towards obtaining the SGD algorithm, considerβ = 0 in (6).The resulting cost function can be expressed asft(V)+g(V),whereV := [AB] andft(V) := (1/2)‖Yt −AYt −BX‖2Fonly accounts for the data acquired at time intervalt. Motivatedby computational simplicity, the “inexact” gradient descentplus soft-thresholding ISTA iterations yield the following up-dates

∇a−ift[t] = Yt

−i

(

(Yt−i)

⊤a−i[t] + xibii[t]− yti

)

(31)

∇biift[t] = a⊤−i[t]Yt−ixi + bii[t]‖xi‖2 − (yt

i)⊤xi (32)

a−i[t+ 1] = Sλt/η

(

a−i[t]− η∇a−ift[t]

)

(33)

bii[t+ 1] = bii[t]− η∇biift[t]. (34)

Compared to the parallel ISTA iterations in Algorithm 1 [cf.(16)-(18)], three main differences are noteworthy: (i) iterationsk are merged with the time intervalst of data acquisition;


Algorithm 4 SGD algorithm for topology tracking

Require: {Yt}Tt=1, X, η.1: Initialize A[1] = 0N×N ,B[1] = IN , λ1.2: for t = 1, . . . , T do3: Updateλt.4: for i = 1 . . .N (in parallel)do5: Form gradients ata−i[t] andbii[t] via (31)-(32).6: Updatea−i[t+ 1] via (33).7: Updatebii[t+ 1] via (34).8: Updateai[t+ 1] via (20).9: end for

10: return At = A[t+ 1], Bt = B[t+ 1].11: end for

(ii) the stochastic gradients∇a−ift[t] and ∇biift[t] involve

the (noisy) data{Yt(Yt)⊤,Yt} instead of their time-averagedcounterparts{Σt, Yt}; and (iii) a generic constant step-sizeηis utilized for the gradient descent steps.

The overall SGD algorithm is tabulated under Algorithm 4.Forming the gradients in (31)-(32) requires one matrix-vectormutiplication by(Yt

−i)⊤ and two byYt

−i. These multiplica-tions dominate the per-iteration computational complexity ofAlgorithm 4, justifying its promised simplicity. Acceleratedversions could be developed as well, at the expense of marginalincrease in computational complexity and doubling the mem-ory requirements.

To gain further intuition on the SGD algorithm developed,consider the online learning paradigm under which the networktopology inference problem is to minimize the expected costE[ft(V) + g(V)] (subject to the usual constraints onV =[AB]). The expectation is taken w.r.t. theunknown probabilitydistribution of the data. In lieu of the expectation, the approachtaken throughout this paper is to minimize the empirical costCT (V) := (1/T )[

∑Tt=1 ft(V) + g(V)]. Note that forβ = 1,

the minimizers ofCT (V) coincide with (4) since the scalingby 1/T does not affect the optimal solution. Forβ < 1, the costCT

β (V) :=∑T

t=1 βT−tft(V) + g(V) implements an EWMA

which “forgets” past data and allows tracking. In all cases,therationale is that by virtue of the law of large numbers, if data{Yt}Tt=1 are stationary, solvinglimT→∞ minV CT (V) yieldsthe desired solution to the expected cost.

A different approach to achieve this same goal – typi-cally with reduced computational complexity – is to dropthe expectation (or the sample averaging operator for thatmatter), and update the estimates via a stochastic (sub)gradientiterationV(t) = V(t−1)−η∂{ft(V)+g(V)}|V=V[t−1]. Thesubgradients with respect toa−i are

∂a−ift[t] = Yt

−i

(

(Yt−i)

⊤a−i[t] + xibii[t]− yti

)

+ λtsign(a−i[t]) (35)

so the resulting algorithm has the drawback of (in general)not providing sparse solutions per iteration; see also [7] for asparse least-mean squares (LMS) algorithm. For that reason,the approach here is to adopt the proximal gradient (ISTA)formalism to tackle the minimization of the instantaneous costsft(V) + g(V), and yield sparsity-inducing soft-thresholdedupdates (33). Also acknowledging the limitation of subgradient

0 100 200 300 400 500 600 700 800 900 10000

0.51

1.5

edge

wei

ght

0 100 200 300 400 500 600 700 800 900 10000

0.51

1.5

0 100 200 300 400 500 600 700 800 900 10000

0.51

1.5

0 100 200 300 400 500 600 700 800 900 10000

0.51

1.5

time

Fig. 3. Nonsmooth variation of synthetically-generated edge weights of thetime-varying network. For each edge, one of the four depicted profiles is chosenuniformly at random.

methods to yield sparse solutions, related “truncated gradient”updates were advocated for sparse online learning in [19].

V. NUMERICAL TESTS

Performance of the proposed algorithms is assessed in thissection via computer simulations using both synthetically-generated network data, and real traces of information cascadescollected from the web [21].

A. Synthetic data

Data generation. Numerical tests on synthetic network dataare conducted here to evaluate the tracking ability and compareAlgorithms 1-4. From a “seed graph” with adjacency matrix

M =

0 0 1 1

0 0 1 1

0 1 0 1

1 0 1 0

a Kronecker graph of sizeN = 64 nodes was generated asdescribed in [20].1 The resulting nonzero edge weights ofAt were allowed to vary overT = 1, 000 intervals under3settings: i) i.i.d. Bernoulli(0.5) random variables; ii) randomselection of the edge-evolution pattern uniformly from a set offour smooth functions:aij(t) = 0.5 + 0.5sin(0.1t), aij(t) =0.5 + 0.5cos(0.1t), aij(t) = e−0.01t, andaij(t) = 0; and iii)random selection of the edge-evolution pattern uniformly froma set of four nonsmooth functions shown in Fig. 3.

The number of contagions was set toC = 80, and X

was formed with i.i.d. entries uniformly distributed over[0, 3].Matrix Bt was set to diag(bt), wherebt ∈ R

N is a standardGaussian random vector. During time intervalt, infection timeswere generated synthetically asYt = (IN−At)−1(BtX+Et),whereEt is a standard Gaussian random matrix.

1The Matlab implementation of Algorithms 1-4 used here can handlenetworks of several thousand nodes. Still a smaller networkis analyzedsince results are still representative of the general behavior, and offers bettervisualization of the results in e.g., the adjacency matrices in Figs. 5 and 6.


0 100 200 300 400 500 600 700 800 900 100010

−4

10−3

10−2

10−1

100

time

MS

E

random variationsnon−smooth variationssmooth variations

Fig. 4. MSE versus time obtained using pseudo real-time ISTA(Algorithm1), for different edge evolution patterns.

actual, t = 45020 40 60

10

20

30

40

50

60

inferred, t = 45020 40 60

10

20

30

40

50

60

actual, t = 90020 40 60

10

20

30

40

50

60

inferred, t = 90020 40 60

10

20

30

40

50

60

Fig. 5. Actual adjacency matrixAt and corresponding estimateAt obtainedusing pseudo real-time ISTA (Algorithm 1), at time intervals t = 450 andt = 900.

Performance evaluation.With β = 0.98, Algorithm 1 wasrun after initializing the relevant variables as describedinthe algorithm table (cf. Section III-B), and settingλ0 = 25.In addition, λt = λ0 for t = 1, . . . , T as discussed inRemark 3. Fig. 4 shows the evolution of the mean-squareerror (MSE),

∑

i,j(atij − atij)

2/N2. As expected, the bestperformance was obtained when the temporal evolution ofedges followed smooth functions. Even though the Bernoullievolution of edges resulted in the highest MSE, Algorithm 1still tracked the underlying topology with reasonable accuracyas depicted in the heat maps of the inferred adjacency matrices;see Fig. 5.

Selection of a number of parameters is critical to theperformance of the developed algorithms. In order to evaluatethe effect of each parameter on the network estimates, severaltests were conducted by tracking the non-smooth network

actual network20 40 60

10

20

30

40

50

60

inferred, λ = 020 40 60

10

20

30

40

50

60

inferred, λ = 5020 40 60

10

20

30

40

50

60

inferred, λ = 10020 40 60

10

20

30

40

50

60

Fig. 6. Actual adjacency matrix att = 900 compared with the inferredadjacency matrices using pseudo real-time FISTA (Algorithm 2), withλt = λfor all t and λ = 0, λ = 50, and λ = 100. While λ = 0 and λ = 50markedly overestimate the support set associated with the true network edges,the valueλ = 100 in this case appears to be just about right.

evolution using Algorithm 2 with varying parameter values.To illustrate the importance of leveraging sparsity of the edgeweights, Fig. 6 depicts heatmaps of the adjacency matricesinferred att = 900, with λ set to0, 50, and 100 for all timeintervals. Comparisons with the actual adjacency matrix revealthat increasingλ progressively refines the network estimatesby driving erroneously detected nonzero edge weights to0.Indeed, the valueλ = 100 in this case appears to be just aboutright, while smaller values markedly overestimate the supportset associated with the edges present in the actual network.

Fig. 7 compares the MSE performance of Algorithm 2for β ∈ {0.999, 0.990, 0.900, 0.750}. As expected, the MSEassociated with values ofβ approaching1 degrades moredramatically when changes occur within the network (at timeintervals t = 250, t = 500, and t = 750 in this case; seeFig. 3). The MSE spikes observed whenβ ∈ {0.999, 0.990}are a manifestation of the slower rate of adaptation of thealgorithm for these values of the forgetting factor. In thisexperiment,β = 0.990 outperformed the rest fort > 500. Inaddition, comparisons of the MSE performance in the presenceof increasing noise variance are depicted in Figure 8. Althoughthe MSE values are comparable during the initial stages of thetopology inference process, as expected higher noise levels leadto MSE performance degradation in the long run.

Finally, a comparison of the real-time version of the differentalgorithms was carried out when tracking the synthetic time-varying network with non-smooth edge variations. Specifically,the real-time (inexact) counterparts of ISTA, FISTA (cf. Al-gorithm 3), SGD (cf. Algorithm 4), and a suitably modifiedversion of the ADMM algorithm developed in [3] were runas suggested in Section IV-B, i.e., eliminating the innerwhileloop in Algorithms 1 and 2 so that a single iteration is runper time interval. Fig. 9 compares the resulting MSE curves as


0 100 200 300 400 500 600 700 800 900 100010

−4

10−3

10−2

10−1

100

time

MS

E

β = 0.999β = 0.990β = 0.900β = 0.750

Fig. 7. MSE performance of the pseudo real-time FISTA (Algorithm 2) versustime, for different values of the forgetting factorβ.

0 200 400 600 800 100010

−4

10−3

10−2

10−1

100

time

MS

E

σ2 = 1

σ2 = 2.5

σ2 = 5

Fig. 8. MSE performance of the pseudo real-time FISTA (Algorithm 2) versustime, for different values of the noise varianceσ2.

the error evolves with time, showing that the inexact onlineFISTA algorithm achieves the best error performance. TheMSE performance degradation of Algorithm 3 relative to its(exact) counterpart Algorithm 2 is depicted in Fig. 2, as afunction of the number of inner iterationsk.Comparison with [35]. The proposed Algorithm 2 is com-pared here to the method of [35], which does not explicitlyaccount for external influences and edge sparsity. To this end,the stochastic-gradient descent algorithm (a.k.a. “InfoPath”)developed in [35] is run using the generated synthetic data withnon-smooth edge variations. Postulating an exponential trans-mission model, the dynamic network is tracked by InfoPathby performing MLE of the edge transmission rates (see [35]for details of the model and the algorithm). Note that thepostulated model therein differs from (3), used here to generatethe network data. Fig. 10 depicts the MSE performance of“InfoPath” compared against FISTA. Apparently, there is anorder of magnitude reduction in MSE by explicitly modelingexternal sources of influence and leveraging the attribute ofsparsity.

0 100 200 300 400 500 600 700 800 900 100010

−4

10−3

10−2

10−1

100

time

MS

E

SGDISTAFISTAADMM

Fig. 9. MSE performance of the real-time algorithms versus time. Algorithms3 (real-time FISTA) and 4 (SGD), as well as inexact versions of Algorithm 1(ISTA) and the ADMM solver in [3] are compared.

0 200 400 600 800 100010

−4

10−3

10−2

10−1

100

time

MS

E

FISTAInfoPath

Fig. 10. MSE performance evolution of the pseudo real-time FISTA(Algorithm 2) compared with the InfoPath algorithm in [35].

B. Real data

Dataset description.The real data used was collected duringa prior study by monitoring blog posts and news articles formemes (popular textual phrases) appearing within a set of over3.3 million websites [35]. Traces of information cascades wererecorded over a period of one year, from March2011 tillFebruary2012; the data is publicly available from [21]. Thetime when each website mentioned a specific news item wasrecorded as a Unix timestamp in hours (i.e., the number ofhours since midnight on January1, 1970). Specific globally-popular topics during this period were identified and cascadedata for the top5, 000 websites that mentioned memes asso-ciated with them were retained. The real-data tests that followfocus on the topic “Kim Jong-un”, the current leader of NorthKorea whose popularity rose after the death of his father andpredecessor, during the observation period.

Data was first pre-processed and filtered so that only (sig-nificant) cascades that propagated to at least7 websites were


0 5 10 15 20 25 30 35 40 450

1000

2000

3000

4000

5000

6000

7000

time (weeks)

tota

l num

ber

of e

dges Death of Kim Jong−il

Kim Jong−un appointed as vicechairman of military commission

Kim Jong−un becomes ruler of N. Korea

Fig. 11. Plot of total number of inferred edges per week.

retained. This reduced the dataset significantly to the360 mostrelevant websites over which466 cascades related to “KimJong-un” propagated. The observation period was then splitinto T = 45 weeks, and each time interval was set to oneweek. In addition, the observation time-scale was adjustedtostart at the beginning of the earliest cascade.

Matrix Yt was constructed by settingytic to the time whenwebsitei mentioned phrasec if this occurred during the spanof week t. Otherwiseytic was set to a large number,100tmax,wheretmax denotes the largest timestamp in the dataset. Typi-cally the entries of matrixX capture prior knowledge about thesusceptibility of each node to each contagion. For instance, theentryxic could denote the online search rank of websitei fora search keyword associated with contagionc. In the absenceof such real data, the entries ofX were generated randomlyfrom a uniform distribution over the interval[0, 0.01].Experimental results. Algorithm 2 was run on real data withβ = 0.9 andλt = 100. Fig. 12 depicts circular drawings of theinferred network att = 10, t = 30, and t = 40 weeks. Littlewas known about Kim Jong-un during the first10 weeks ofthe observation period. However, speculation about the possiblesuccessor of the dying North Korean ruler, Kim Jong-il, roseuntil his death on December 17, 2011 (week38). He wassucceeded by Kim Jong-un on December 30, 2011 (week40). The network visualizations show an increasing numberof edges over the45 weeks, illustrating the growing interestof international news websites and blogs in the new ruler.Unfortunately, the observation horizon does not go beyondT = 45 weeks. A longer span of data would have been usefulto investigate at what rate did the global news coverage on thetopic eventually subside.

Fig. 11 depicts the time evolution of the total number ofedges in the inferred dynamic network. Of particular interestare the weeks during which: i) Kim Jong-un was appointed asthe vice chairman of the North Korean military commission;ii) Kim Jong-il died; and iii) Kim Jong-un became the ruler ofNorth Korea. These events were the topics of many online newsarticles and political blogs, an observation that is reinforced bythe experimental results shown in the plot.

VI. CONCLUDING SUMMARY

A dynamic SEM was proposed in this paper for networktopology inference, using timestamp data for propagation ofcontagions typically observed in social networks. The modelexplicitly captures both topological influences and externalsources of information diffusion over the unknown network.Exploiting the inherent edge sparsity typical of large networks,a computationally-efficient proximal gradient algorithm withwell-appreciated convergence properties was developed tomin-imize a suitable sparsity-regularized exponentially-weightedLS estimator. Algorithmic enhancements were proposed, thatpertain to accelerating convergence and performing the networktopology inference task in real time. In addition, reduced-complexity stochastic-gradient iterations were outlinedandshowed to attain worthwhile performance.

A number of experiments conducted on synthetically-generated data demonstrated the effectiveness of the pro-posed algorithms in tracking dynamic and sparse networks.Comparisons with the InfoPath algorithm revealed a markedimprovement in MSE performance attributed to the explicitmodeling of external influences and leveraging edge sparsity.Experimental results on a real dataset focused on the currentruler of North Korea successfully showed a sharp increase inthe number of edges between media websites in agreementwith the increased media frenzy following his ascent to powerin 2011.

The present work opens up multiple directions for excitingfollow-up work. Future and ongoing research includes: i) inves-tigating the conditions for identifiability of sparse and dynamicSEMs, as well as their statistical consistency properties tied tothe selection ofλt; ii) formally establishing the convergenceof the (inexact) real-time algorithms in a stationary networksetting, and tracking their MSE performance under simplemodels capturing the network variation; iii) devising algorithmsfor MLE of dynamic SEMs and comparing the performance ofthe LS alternative of this paper; iii) generalizing the SEM usingkernels or suitable graph similarity measures to enable networktopology forecasting; and iv) exploiting the parallel structure ofthe algorithms to devise MapReduce/Hadoop implementationsscalable to million-node graphs.

REFERENCES

[1] D. Angelosante, J. A. Bazerque, and G. B. Giannakis, “Online adaptiveestimation of sparse signals: where RLS meets theℓ1-norm,” IEEE Trans.Signal Process., vol. 58, pp. 3436–3447, Jul. 2010.

[2] D. Angelosante and G. B. Giannakis, “Sparse graphical modeling ofpiecewise-stationary time series,” inProc. of Intern. Conf. on Acoustics,Speech and Signal Processing, Prague, Czech Republic, May 2011.

[3] B. Baingana, G. Mateos, and G. B. Giannakis, “Dynamic structuralequation models for tracking topologies of social networks,” in Proc.of 5th Intern. Workshop on Computational Advances in Multi-SensorAdaptive Processing, Saint Martin, Dec. 2013.

[4] J. A. Bazerque, B. Baingana, and G. B. Giannakis, “Identifiability ofsparse structural equation models for directed and cyclic networks,” inProc. of Global Conf. on Signal and Info. Processing, Austin, TX, Dec.2013.

[5] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding al-gorithm for linear inverse problems,”SIAM J. Imag. Sci., vol. 2, pp.183–202, Jan. 2009.

[6] X. Cai, J. A. Bazerque, and G. B. Giannakis, “Gene networkin-ference via sparse structural equation modeling with genetic per-turbations,” PLoS Comp. Biology, vol. 9, May 2013, e1003068doi:10.1371/journal.pcbi.1003068.


(a) t = 10 (b) t = 30 (c) t = 40

Fig. 12. Visualization of the estimated networks obtained by tracking those information cascades related to the topic “Kim Jong-un”. The abrupt increasein network connectivity can be explained by three key events: i) Kim Jong-un was appointed as the vice chairman of the North Korean military commission(t = 28); ii) Kim Jong-il died (t = 38); and iii) Kim Jong-un became the ruler of North Korea (t = 40).

[7] Y. Chen, Y. Gu, and A. O. Hero III, “Sparse LMS for system iden-tification,” in Proc. of Intern. Conf. on Acoustics, Speech and SignalProcessing, Taipei, Taiwan, Apr. 2009.

[8] P. L. Combettes and J.-C. Pesquet, “A proximal decomposition methodfor solving convex variational inverse problems,”Inverse Problems,vol. 24, no. 6, pp. 1–27, 2008.

[9] ——, “Proximal splitting methods in signal processing,”in Fixed-Point Algorithms for Inverse Problems in Science and Engineering,ser. Springer Optimization and its Applications, H. H. Bauschke, R. S.Burachik, P. L. Combettes, V. Elser, D. R. Luke, and H. Wolkowicz, Eds.Springer New York, 2011, pp. 185–212.

[10] I. Daubechies, M. Defrise, and C. D. Mol, “An iterative thresholdingalgorithm for linear inverse problems with a sparsity constraint,” Comm.Pure Appl. Math., vol. 57, pp. 1413–1457, Aug. 2004.

[11] D. Easley and J. Kleinberg,Networks, Crowds, and Markets: ReasoningAbout a Highly Connected World. New York, NY: Cambridge UniversityPress, 2010.

[12] J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covarianceestimation with the graphical lasso,”Biostatistics, vol. 9, pp. 432–441,Dec. 2007.

[13] A. S. Goldberger, “Structural equation methods in the social sciences,”Econometrica, vol. 40, pp. 979–1001, Nov. 1972.

[14] T. Hastie, R. Tibshirani, and J. Friedman,The Elements of StatisticalLearning, 2nd ed. Springer, 2009.

[15] D. Kaplan,Structural Equation Modeling: Foundations and Extensions,2nd ed. Sage Publications, 2009.

[16] E. D. Kolaczyk, Statistical Analysis of Network Data: Methods andModels. New York, NY: Springer, 2009.

[17] M. Kolar, L. Song, A. Ahmed, and E. P. Xing, “Estimating time-varyingnetworks,”Ann. Appl. Statist., vol. 4, pp. 94–123, 2010.

[18] Y. Kopsinis, K. Slavakis, and S. Theodoridis, “Online sparse systemidentification and signal reconstruction using projections onto weightedℓ1 balls,” IEEE Trans. Signal Process., vol. 59, pp. 936–952, Mar. 2011.

[19] J. Langford, L. Li, and T. Zhang, “Sparse online learning via truncatedgradient,”J. of Machine Learning Research, vol. 10, pp. 719–743, Mar.2009.

[20] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahra-mani, “Kronecker graphs: An approach to modeling networks,” J. Ma-chine Learning Research, vol. 11, pp. 985–1042, Mar. 2010.

[21] J. Leskovec, “Web and blog datasets,”Stanford Network Analysis Project,2011. [Online]. Available: http://snap.stanford.edu/infopath/data.html

[22] B. Liu, A. de la Fuente, and I. Hoeschele, “Gene network inferencevia structural equation modeling in genetical genomics experiments,”Genetics, vol. 178, pp. 1763–1776, Mar. 2008.

[23] B. A. Logsdon and J. Mezey, “Gene expression network recon-struction by convex feature selection when incorporating geneticperturbations,” PLoS Comp. Biology, vol. 6, Dec. 2010, e1001014.doi:10.1371/journal.pcbi.1001014.

[24] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrixfactorization and sparse coding,”J. of Machine Learning Research,vol. 11, pp. 19–60, Jan. 2010.

[25] M. Mardani, G. Mateos, and G. B. Giannakis, “Dynamic anomalography:Tracking network anomalies via sparsity and low rank,”IEEE J. Sel.Topics Signal Process., vol. 7, pp. 50–66, Feb. 2013.

[26] N. Meinshausen and P. Buhlmann, “High-dimensional graphs and vari-able selection with the lasso,”Ann. Statist., vol. 34, pp. 1436–1462, 2006.

[27] S. Meyers and J. Leskovec, “On the convexity of latent social networkinference,” inProc. of Neural Information Proc. Sys. Conf., Vancouver,Canada, Feb. 2013.

[28] B. Muthen, “A general structural equation model with dichotomous, or-dered categorical, and continuous latent variable indicators,” Pyschome-trika, vol. 49, pp. 115–132, Mar. 1984.

[29] Y. Nesterov, “A method of solving a convex programming problem withconvergence rateO(1/k2),” Soviet Mathematics Doklady, vol. 27, pp.372–376, 1983.

[30] ——, “Smooth minimization of nonsmooth functions,”Math. Prog., vol.103, pp. 127–152, 2005.

[31] N. Parikh and S. Boyd, “Proximal algorithms,”Found. Trends Optimiza-tion, vol. 1, pp. 123–231, 2013.

[32] J. Pearl,Causality: Models, Reasoning, and Inference, 2nd ed. Cam-bridge University Press, 2009.

[33] I. Ramirez and G. Sapiro, “An MDL framework for sparse coding anddictionary learning,”IEEE Trans. Signal Process., vol. 60, pp. 2913–2927, Jun. 2012.

[34] M. G. Rodriguez, D. Balduzzi, and B. Scholkopf, “Uncovering thetemporal dynamics of diffusion networks,” inProc. of 28th Intern. Conf.Machine Learning, Bellevue, WA, Jul. 2011.

[35] M. G. Rodriguez, J. Leskovec, and B. Scholkopf, “Structure and dy-namics of information pathways in online media,” inProc. of 6th ACMIntern. Conf. on Web Search and Data Mining, Rome, Italy, Dec. 2010.

[36] E. M. Rogers,Diffusion of Innovations, 4th ed. Free Press, 1995.[37] V. Solo and X. Kong,Adaptive Signal Processing Algorithms: Stability

and Performance. Prentice Hall, 1995.[38] P. Sprechmann, I. Ramirez, G. Sapiro, and Y. Eldar, “C-HiLasso: A

collaborative hierarchical sparse modeling framework,”IEEE Trans.Signal Process., vol. 59, no. 9, pp. 4183–4198, Sep. 2011.

[39] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo, “Sparse reconstruc-tion by separable approximation,”IEEE Trans. on Sig. Proc., vol. 57, pp.2479–2493, 2009.

[40] M. Yuan and Y. Lin, “Model selection and estimation in regression withgrouped variables,”J. Royal. Statist. Soc B, vol. 68, pp. 49–67, 2006.

http://snap.stanford.edu/infopath/data.html

dynamic structural equation models for social network ... · dynamic structural equation models for...

Documents