mining the hidden link structure from distribution flows for a … · 2019. 7. 30. · and barunik...

18
Research Article Mining the Hidden Link Structure from Distribution Flows for a Spatial Social Network Yanqiao Zheng, 1 Xiaobing Zhao, 2 Xiaoqi Zhang , 1 Xinyue Ye, 3 and Qiwen Dai 4 School of Finance, Zhejiang University of Finance and Economics, China School of Data Science, Zhejiang University of Finance and Economics, China Urban Informatics-Spatial Computing Lab & College of Computing, New Jersey Institute of Technology, USA School of Economics & Management, Guangxi Normal University, China Correspondence should be addressed to Xiaoqi Zhang; xiaoqizh@buffalo.edu Received 30 December 2018; Revised 3 March 2019; Accepted 31 March 2019; Published 2 May 2019 Academic Editor: Giulio Cimini Copyright © 2019 Yanqiao Zheng et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. is study aims at developing a non-(semi-)parametric method to extract the hidden network structure from the {0, 1}-valued distribution flow data with missing observations on the links between nodes. Such an input data type widely exists in the studies of information propagation process, such as the rumor spreading through social media. In that case, a social network does exist as the media of the spreading process, but its link structure is completely unobservable; therefore, it is important to make inference of the structure (links) of the hidden network. Unlike the previous studies on this topic which only consider abstract networks, we believe that apart from the link structure, different social-economic features and different geographic locations of nodes can also play critical roles in shaping the spreading process, which has to be taken into account. To uncover the hidden link structure and its dependence on the external social-economic features of the node set, a multidimensional spatial social network model is constructed in this study with the spatial dimension large enough to account for all influential social-economic factors. Based on the spatial network, we propose a nonparametric mean-field equation to govern the rumor spreading process and apply the likelihood estimator to make inference of the unknown link structure from the observed rumor distribution flows. Our method turns out easily extendible to cover the class of block networks that are useful in most real applications. e method is tested through simulated data and demonstrated on a data set of rumor spreading on Twitter. 1. Introduction Flow data has been widely studied by different disciplines [1– 6]. Especially in recent years, the development of internet makes an increasing amount of flow data sets publicly available, among them new types of flows are emerging and attracted more and more attentions from scholars [7, 8]. Unlike the physical movement, such as the trajectory of taxi, the information flow data, such as the time series of the retweet status of a class of tweet articles within a population, does not contain any trajectory-level information, because a user may tweet aſter he saw many friends had done so. In that case, a group of friends can contribute to the spreading of the tweet, and it becomes impossible to figure out which one is the real single source, neither is it possible to track the trajectory of retweeting. erefore, this flow data are no longer stored as a collection of well-defined trajectories; instead, they consist of a time series of distributions of a given kind of information within entire population. In addition, the distribution flows are highly “context-dependent”, which means the social-economic factors behind every agent join- ing the spreading process (such as the education, income, and the neighborhood) might significantly affect the speed, extent, and coverage of spreading, suggesting a spatial social network to be uncovered from the distribution flows. Of course, the emergence of new types and new features of flow data inevitably brings unprecedented opportunities to improve our understanding of interaction patterns between people and thus enrich relevant theories, but the missing observation on the trajectory-level information and the add- in of social-economic context make it challenging to uncover the agent-to-agent links, or equivalently the entire hidden Hindawi Complexity Volume 2019, Article ID 6902027, 17 pages https://doi.org/10.1155/2019/6902027

Upload: others

Post on 24-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

Research ArticleMining the Hidden Link Structure from Distribution Flows fora Spatial Social Network

Yanqiao Zheng1 Xiaobing Zhao2 Xiaoqi Zhang 1 Xinyue Ye3 and Qiwen Dai4

1 School of Finance Zhejiang University of Finance and Economics China2School of Data Science Zhejiang University of Finance and Economics China3Urban Informatics-Spatial Computing Lab amp College of Computing New Jersey Institute of Technology USA4School of Economics amp Management Guangxi Normal University China

Correspondence should be addressed to Xiaoqi Zhang xiaoqizhbuffaloedu

Received 30 December 2018 Revised 3 March 2019 Accepted 31 March 2019 Published 2 May 2019

Academic Editor Giulio Cimini

Copyright copy 2019 Yanqiao Zheng et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

This study aims at developing a non-(semi-)parametric method to extract the hidden network structure from the 0 1-valueddistribution flow data with missing observations on the links between nodes Such an input data type widely exists in the studiesof information propagation process such as the rumor spreading through social media In that case a social network does exist asthe media of the spreading process but its link structure is completely unobservable therefore it is important to make inferenceof the structure (links) of the hidden network Unlike the previous studies on this topic which only consider abstract networkswe believe that apart from the link structure different social-economic features and different geographic locations of nodes canalso play critical roles in shaping the spreading process which has to be taken into account To uncover the hidden link structureand its dependence on the external social-economic features of the node set a multidimensional spatial social network model isconstructed in this study with the spatial dimension large enough to account for all influential social-economic factors Based on thespatial network we propose a nonparametric mean-field equation to govern the rumor spreading process and apply the likelihoodestimator tomake inference of the unknown link structure from the observed rumor distribution flowsOurmethod turns out easilyextendible to cover the class of block networks that are useful in most real applications The method is tested through simulateddata and demonstrated on a data set of rumor spreading on Twitter

1 Introduction

Flow data has been widely studied by different disciplines [1ndash6] Especially in recent years the development of internetmakes an increasing amount of flow data sets publiclyavailable among them new types of flows are emerging andattracted more and more attentions from scholars [7 8]

Unlike the physical movement such as the trajectory oftaxi the information flow data such as the time series of theretweet status of a class of tweet articles within a populationdoes not contain any trajectory-level information because auser may tweet after he saw many friends had done so Inthat case a group of friends can contribute to the spreadingof the tweet and it becomes impossible to figure out whichone is the real single source neither is it possible to trackthe trajectory of retweeting Therefore this flow data are

no longer stored as a collection of well-defined trajectoriesinstead they consist of a time series of distributions of a givenkind of information within entire population In additionthe distribution flows are highly ldquocontext-dependentrdquo whichmeans the social-economic factors behind every agent join-ing the spreading process (such as the education incomeand the neighborhood) might significantly affect the speedextent and coverage of spreading suggesting a spatial socialnetwork to be uncovered from the distribution flows

Of course the emergence of new types and new featuresof flow data inevitably brings unprecedented opportunities toimprove our understanding of interaction patterns betweenpeople and thus enrich relevant theories but the missingobservation on the trajectory-level information and the add-in of social-economic context make it challenging to uncoverthe agent-to-agent links or equivalently the entire hidden

HindawiComplexityVolume 2019 Article ID 6902027 17 pageshttpsdoiorg10115520196902027

2 Complexity

interaction network As a result it becomes necessary todevelop more data-driven approaches tailored to uncoverthe hidden spatial social network behind the distributionflows

Distribution flows are frequently studied in the field ofrumor andor flu spreading Existing methods in broadterms can be suppressed into two classes the agent-basedmodellingsimulationcalibration (ABM) techniques [9ndash15]and the differential equation (DE) based approaches [16ndash19]The class of DE approaches is helpful to derive qualitativeconclusions regarding the steady state distribution of thespreading processes and how the equilibrium depends onmodel parameter in a coarse sense However to guaranteethe meaningful qualitative results are achievable the setupof differential equations is often oversimplified but it wouldcause the loss of insights into the complex reality In additiondue to the lack of explicit solution in most cases it is notpossible to apply the DE techniques to fit the real dataand generate detailed quantitative results In contrast ABMapproaches are more realistic and suitable for quantitativeresearch on the real distribution flowsHowever there are stilla couple of shortages in the existing ABMmodels [20]

First ABM often assumes that the spreading processis carried on a network where nodes represent agents thatcan potentially spread out or be infected with a certaintype of object (eg rumor) edges are the links betweenagents Rumor can only be spread between agents linked byedges Under ABM framework this interaction network issupposed to be known and prescribed in prior Prior networkmay lose critical information of the interaction patterns ofpopulation [15 21 22] For instance in the Twitter network anatural interaction network structure is the network formedby friendship or followership relation between users which isalso frequently used as the prior network for rumor spreadingstudies [14 18] However rumor does not have to followthis network to spread [23] In fact the retweet action ofbig name users is more likely to be visible through otherchannels to those users who are not linked in the friendshipnetwork such as by TV shows and newspapers Thereforethe spreading between a big name user and an ordinary useris still possible even if they are not linked at all by merelycounting the friend or follower relation The existence ofhidden links makes prior network fail to capture all structuralfeatures of interactions a data-driven or posterior networkwould be helpful to overcome this issue

Second given the prior network ABM assumes thespreading occurs through interaction mechanisms betweentwo randomly picked agents Widely used interaction mecha-nisms include the independent cascade model and the linearthreshold model and so on [24ndash26] These mechanisms areoften parametrized and assumed homogeneous for all agentsie the mechanism is determined by a set of parametersthat are constant and invariant for different agents In realityboth of the relative positions of an agent within the networksuch as the degree centrality betweenness of an agent [1718 27] and many social-economic factors external to theentire network such as the geographic location social statuseducation level and wealth [22 27] can drastically affectthe likelihood that agents get infected by the rumor But

the heterogeneity among agents is often missing from thestandard ABM framework

To resolve the above issues we propose a novel and com-pletely data-driven modelling approach to characterize thehidden interaction network and the spreading process Ourstudy contributes to the existing literature in the followingaspects

First we consider the interaction network as a weightedmultidimensional spatial social network which is an exten-sion to the standard spatial network and the nodes in thenetwork are embedded into amultidimensional feature spaceR119901 The weighted edge between nodes is considered as acontinuous function onR119901 timesR119901 Within such a network thevalue of edge weight function can depend on features of boththe start nodes and end nodes so it gives full respect to theheterogeneity of nodes and its effect on shaping the spreadingprocess and distribution flow

Second we link the interaction network with the dis-tribution flows by the classical mean-field models [9 16ndash18] and the law of distribution transition is realized by akernel operator with its kernel function given by the edgeweight function Such a construction allows the infectionstatus of a given node to depend on all other nodes in thenetwork in a smooth manner which avoids the arbitrarinessof distinguishing the impact of neighbor and nonneighbornodes while also facilitating the inclusion of the contextinformation embedded in the spatial social network into theanalysis of spreading

Third we adopt the kernel smoothing technique andnonparametric likelihood estimation from statistics [28 29]to fit our model into real distribution flows where theentire edge weight function is supposed to be unknown andneeds to be estimated from the distribution flow data fromthe real world The nonparametricity makes our method apowerful tool of information mining for distribution flowdata Finally the widely used block models [30ndash33] canbe easily incorporated into our framework which helpsbetter uncover hidden social-economic connections betweenindividuals from distribution flows

The paper is organized as follows In Section 2 wegive an overview of existing methods of network estima-tion Section 3 formally presents the setup of our methodincluding the definition of feature space network mean-fieldmodels and their simulation techniques and the design ofour likelihood estimators Section 4 validates the effectivenessof our estimators to the hidden network by synthetic dataand numerical experiments Section 5 applies our method toa distribution flow dataset of the information spreading onTwitter relevant to the event ldquoUnite the Right rallyrdquo 2017

2 Relevant Methods

The proposed method in this paper is essentially a networkestimation tool while network estimation is a long-standingtopic in many different fields

In the studies of agent-based model (ABM) simulation-based estimation is usually adopted to calibrate the unknownparameters involved in model setup [13ndash15 18 34 35]Simulation-based estimation is efficient in dealing with the

Complexity 3

estimation of ABMs as it is often impossible to derive ananalytic expression for the standard error functions in ABMsetting simulation can help generate an empirical version ofthe error function and facilitate the application of standardordinary least square (OLS) and maximum likelihood (ML)estimation strategy However the simulation-based estima-tion is more frequently applied to parametric ABM whereonly a finite-dimensional parameter vector is to be estimatedit is rarely used to estimate the hidden network structure asthe unknown network is essentially nonparametric whichis less tractable than the parametric models To our bestknowledge the only exception comes from Grazzini andRichiardi [35] Kukacka and Barunik [36] in which theinteraction mechanism when two agents meet is allowedto include a nonparametric component and the kernelsmoothing method and nonparametric likelihood (or leastsquare) estimators are applied to cope withmodel estimationHowever Grazzini and Richiardi [35] Kukacka and Barunik[36] do not include the interaction network between agentsinto their analysis nor the model identifiability issue isresolved thus further exploration is needed in this direction

The other related works deal with link prediction bystochastic-network models In this field nonparametrictricks are more often adopted to make inference of hiddenfeatures of stochastic network [23 31 32 37 38] Lu and Zhou[31] review the main-stream heuristic algorithms to forecastthe missing links within a partially observed network Bickelet al [39] from the perspective of statistic inference sum-marize and validate the application of variational expectationmaximization (VEM) algorithm to infer the probability ofexistence of a link between two nodes from observed edgedata Matias et al [38] extend the VEM method to deal withthe future occurrence probability of edges given a dynamiclinked network and the historic edge data this extendedmethod can handle the case where the evolution of occur-rence probability depends nonparametrically on an unknownhazard function All these methods were developed under acommonassumption that at least the edge information of partof the network has already been observed which is possiblefor trajectory data but not possible for distribution flowsThus a further extension is needed to handle the case thatall edge data are missing

In the literature of physics the task of detecting thehidden network link structure from node-level time-seriesdata is phrased as ldquonetwork reconstructionrdquo Taking distri-bution flows as the input two outstanding network recon-structionmethodologies are directly comparable to oursOneis based on the compressive sensing technique as proposedin Shen et al [40] the other is based on the combinationof likelihood estimation and the mean-field approximationtechnique as discussed inRoudi andHertz [41]Thebasic ideain Shen et al [40] is to convert the network reconstructionproblem to a classical convex optimization problem withlinear constraints which is the so-called compressive sensing(CS) problem In the CS problem the linear constraintscome from the transition probability of nodes within thenetwork from the uninfected state to the infected state whilethe objective function arises from the sparsity assumptionregarding the network link structure Unlike the applications

of CS approach to the network reconstruction from contin-uous time-series data [42ndash44] where the feature variablesassociated with every node are directly observable in the caseof distribution flows the key variable transition probabilityis not observable from the data Therefore it has to becalculated so as to form the required linear constraintsInferring the transition probability from the 0 1-valueddistribution flow data requires a stationary assumption onthe underlying model which is too restrictive in manyapplications For instance in the spreading of virus an agentmight die immediately after it is infected in which case theinfected agent is censored in the sense that its infectiousstatus is constantly one since the time of being infectedWhen censored agents exist in the network stationarity ofthe transition is impossible and the CS framework in Shenet al [40] is no longer applicable The other problem ofthe CS framework is its incapability of handling the spatialheterogeneity among different nodes As we have highlightedthat the education wealth and many other social-economicfactors can play critical roles to determine the link strengthamong people and therefore affect the information spreadingdynamics modelling the dependence of the hidden linkstructure on those social-economic factors is necessary in thestudies of social network The inclusion of social-economicfactors would introduce heterogeneity among nodes whichmakes it challenging to identify which two nodes are rel-atively homogeneous and can be grouped together In theCS framework grouping different nodes is the premise tocalculate the transition probability In an abstract networkall nodes are homogeneous and the grouping can be simplytaken as the set of all nodes as done in Shen et al [40] whilein a spatial network with heterogeneity widely existing such asimple grouping trick is meaningless How to extend the CSframework to spatial social network becomes a tough job andextensive studies are needed

The deep reason that restricts the CS framework is itsreliance on the unobservable transition probability Thatrestriction can be effectively resolved by applying the likeli-hood technique as suggested in Roudi and Hertz [41] Thegoodness of likelihood-based approach is that it can com-pute the unknown transition probability simultaneously withthe other model parameters But the computation usuallytakes too much time because there is no explicit solutionfor the first-order condition of the maximum likelihoodnumerical solution is required To make the computationeasier a mean-filed approximation technique is presentedin Roudi and Hertz [41] which can definitely increase thecomputation speed However the approximation can onlywork for the case that all link strengths have to be close tozero which restricts its usefulness in many applications ofsocial network On the other hand the current version ofthe approximation technique in Roudi and Hertz [41] stillassumes an abstract network structure and no dependenceof the link strength on social-economic factors is allowedit is unclear whether the approximation is extendible toaccount for the reconstruction of spatial social networksFinally Roudi and Hertz [41] are only concerned with thesituation that the number of nodes (119873) is relatively small andthe computation complexity comes mainly from numerically

4 Complexity

solving the maximum likelihood problem But when 119873 islarge the computation complexity would be dominated bythe matrix multiplication for the 119873 times 119873 adjacency matrixSince the approximation technique in Roudi and Hertz [41]still requires the matrix multiplication its speed-up effect forgiant networks may not be that significant More explorationson the fast reconstruction of giant spatial social networks areneeded

3 Model Setup

31 Feature Space Network We consider a weighted multidi-mensional spatial social network where nodes of the networkare considered as elements in a 119901-dimensional EuclideanspaceR119901 and every dimension ofR119901 is interpreted as a featureof nodes thus R119901 is interpretable as a feature space Edgesbetween nodes are assumed to depend on features of nodesin a smooth way ie edge set of the graph is equivalentto a smooth function (up to a certain order of derivatives)or an almost-everywhere smooth function (ie the functionis smooth for all points except those contained in a zero-measure set) denoted as 119864 R119901 times R119901 997888rarr [0 1] wherewithout loss of generality edge weight between two nodesis restrained within the unit interval Such a specificationadmits a stochastic-network interpretation of our model theweight can be thought of as the probability that two nodesshare an edge Since the nodes of the network may not beevenly distributed within the entire space R119901 without loss ofgenerality we assume the nodersquos distribution is characterizedby a probability measure 119865 on R119901 and 119865 is supposed tobe known from the data In sum the 119901-dimensional spatialnetwork can be recorded as 119866(R119901 119864 119865) or shortly 119866 whenthere is no ambiguity regarding its nodes space distributionand edge function

There are several advantages to assume that the spreadingprocess and distribution flows occurred within 119866 First theembedding of the node set into feature space R119901 allows us tocharacterize the feature information of nodes that are externalto the network structure [21 22 27] which are usually asimportant as the network structure itself in determining thespreading process and distribution flows Luo et al [22] arguethat including social-economic factors such as the intensityof population gathering in a set of locations can significantlyincrease the capacity of forecast of illness spreading amongresidents Viboud et al [45] report similar findings Secondallowing nodes unevenly distributed within the feature spaceadmits us to include more general network into analysis Forinstance by proper choice of the measure 119865 (eg finitelysupported) it is even possible to consider a network withonly finitely many nodes but sitting in the infinite featurespace R119901 this allows us to include most of networks thatwe can meet in practice Finally allowing the edge weight tosmoothly depend on features of both the flow-in and flow-out nodes makes it possible to incorporate the backgroundinformation into the interaction mechanism this is criticalwhen the network itself is only a small component of a largerbackground system [27] In addition a by-product of treatingedges as a smooth function is its induced computationalefficiency In fact when a network consists of a giant number

of nodes even a simple summation operation can take a longtime and huge memory but when edges vary smoothly alongwith nodes it becomes possible to only do calculation on asmall set of nodes and the global features of edges then canbe inferred from the result on the relatively small set by thekernel smoothing technique from nonparametric statistics[28 29] Based on these advantages we will concentrate onthe spatial network 119866(R119901 119864 119865) instead of a more generalconcept of network

32 Mean-Field Models To model spreading processeswithin a spatial network 119866(R119901 119864 119865) we follow the conven-tion in the studies in rumor spreading literature [10 17] andadopt the common assumption that a rumor can be spreadout from a node 119909 to the other 119910 if and only if (1) the initialnode 119909must have been infected with the rumor recorded asthe event 119868(119909) = 1 (2) there is an edge between them orequivalently 119864(119909 119910) gt 0 and (3) when condition (1) and (2)hold whether or not the spreading actually happens is purelyrandom up to a probability 119903 Different spreading modelsimpose different requirement on the probability 119903 In thecurrent studies we adopt the mean-field model to determine119903 as suggested inmost of previous studies Formally for everyfixed time 119905 the probability of node 119909 isin R119901 being infected isdetermined by the following mean-field equation119889119903 (119909 119905)119889119905 = (1 minus 119903 (119909 119905)) sdot int

R119901119864 (119909 119910) 119903 (119910 119905) 119889119865 (119910) (1)

The interpretation of (1) is that at 119905 the temporal variationrate of the probability that node 119909 is infected (represented as119889119903(119909 119905)119889119905) is a proportion to the probability that node 119909 hasnot yet been infected by time 119905 (represented as 1 minus 119903(119909 119905))and the proportion is determined through a weighted sum ofthe probability of all other nodes in the network having beeninfected by 119905 The weight function describes the strength ofconnection between nodes 119909 and 119910 thus can be formulatedas the edge function 119864 Using the classical result ofmean-fieldequations [46ndash50] it can be easily verified that the infectionprobability 119903(119909 119905) in (1) is exactly equal to the probabilityof 119868(119909 119905) = 1 for a given right-continuous mean-field pointprocess 119868 satisfying the following119864 (119868 (119909 119905) minus 119868 (119909 119905minus) | 119868 (119909 119905minus) = 0)= int

R119901119864 (119909 119910) 119868 (119910 119905minus) 119889119865 (119910) (2)

where 119868(119909 119905minus) is the left-limit of process 119868(119909 sdot) The interpre-tation of (2) is more straightforward than (1) (2) points outthat the average rate of node 119909 being infected is contributedby all those nodes that (1) have a connection to 119909 and (2) havebeen infected by the current time These two conditions areoften imposed in literature

Let 119903 be a function satisfying the functional differentialequation (1) also denote 119891 as the density or mass functionassociated with probability 119865 then the event that a givennode 119909 is observed at time 119905 and its infectious status isobserved to be infected has the probability density

p1 (119909 119905) = 119891 (119909) 119903 (119909 119905) (3)

Complexity 5

in contrast the density for the event that 119909 is observed to beuninfected at 119905 is given as

p0 (119909 119905) = 119891 (119909) (1 minus 119903 (119909 119905)) (4)

Suppose that given a time 119905 the infectious status of a set ofrandomly picked nodesN isin R119901 is observable and represent-ed as

O119905 = 119868 (119909 119905) 119909 isinN (5)

with 119868(119909 119905) = 0 being not infected and 119868(119909 119905) = 1 beinginfected then the likelihood function of the observations O119905can be written in the following way by using (3) and (4)119871 (O119905 119864)= prod

119909isinN

(119891 (119909) 119903 (119909 119905))119868(119909119905) (119891 (119909) (1 minus 119903 (119909 119905)))1minus119868(119909119905) (6)

where we add the edge function 119864 into likelihood becauseit affects 119871 through determining the functional form of 119903Maximizing (6) can yield the classical maximum likelihood(ML) estimator of 11986433 Nonparametric Likelihood Estimator and Kernel Smooth-ing In the study of spreading process only the distributionflows of the form (5) are available the details of link structurebetween nodes represented by edge function 119864 are notobservable thus need to be estimated In this section weconstruct a nonparametric simulated maximum likelihoodestimator (NPSML) to the functional form of 119864 given theobserved distribution flows O119905119894 119894 = 1 119879 1199051 lt sdot sdot sdot lt119905119879 on a sequence of time The NPSML is an efficient non-parametric inference technique proposed by Kristensen andShin [29] NPSML applies well to the case where an explicitexpression of the likelihood function is not achievable whichis exactly what we need to handle because the distributionfunction 119903 in (6) is the solution to the functional differentialequation (1) there is no clean analytic expression available forit

However our task is different from the situation discussedoriginally in Kristensen and Shin [29] First the originalNPSML applies nonparametric kernel smoothing to approxi-mate the unknown likelihood function the model generatingthe likelihood function is still parametric but in (6) thelikelihood depends on the nonparametric edge function 119864To this situation one extra kernel smoothing step is needed toapproximate119864 Second in Kristensen and Shin [29] Kukackaand Barunik [36] simulation is conducted on the level ofrandom variable while in our case simulation is on thelevel of distribution that is equivalent to numerically solvethe mean-field equation (1) Finally due to the involvementof nonparametric model setup the model identifiability hasto be checked in order to guarantee the correctness of theresulting estimation

Due to the first and second differences we provide thefollowing algorithm to generate the simulated likelihoodfunction (in the following constructions we always use119870119901 to

denote the119901-dimensional standardGaussian kernel function119870119901ℎ(119909) = 119870119901(119909ℎ)ℎ119901 for some positive constant ℎ)

Step 1 Select constant 119889119905 gt 0 large positive integer 1198721and 1198722 (119889119905 is the length of every time step used fornumerically solving the functional differential equation (1)1198721 and1198722 are the number of random samples that will bedrawn to generate the kernel smoothing approximation to theunknown likelihood function and edge weight function)

Step 2 Draw 1198721 random samples 1199091 1199091198721 isin R119901 fromdistribution 119865 and1198722 random samples 1199081 1199081198722 isin R119901 timesR119901 from the product measure 119865 otimes 119865Step 3 Given 1198901 1198901198722 isin [0 1] construct function 119864 asfollows

119864 (119908) = sum1198722119894=11198702119901ℎ1 (119908 minus 119908119894) sdot 119890119894sum1198722119895=11198702119901ℎ1 (119908 minus 119908119895) (7)

Step 4 Given 119905119894 let O119905119894 = 119868(1199101 119905119894) 119868(119910119872 119905119894) denote theobservation set at time 119905119894 whose cardinality is119872 constructingfunction 119903( 119905119894) as follows

119903 (119910 119905119894) = sum119872119897=1119870119901ℎ2 (119910 minus 119910119897) sdot 119868 (119910119897 119905119894)sum119872119895=1119870119901ℎ2 (119910 minus 119910119895) (8)

Step 5 Solve mean-field equation (1) over interval [119905119894 119905119894+1) atthe set of sample point 1199091 1199091198721 drawn in Step 2 byEulerrsquosmethod with time step 119889119905 subject to the initial condition119903( 119905119894) as follows119903 (119909119895 119905119894 + (119896 + 1) sdot 119889119905)= 119903 (119909 119905119894 + 119896 sdot 119889119905) + (1 minus 119903 (119909119895 119905119894 + 119896 sdot 119889119905)) sdot 1198891199051198721

sdot 1198721sum119897=1

119864 (119909119895 119909119897) 119903 (119909119897 119905119894 + 119896 sdot 119889119905)(9)

where 119896 = 0 1 lfloor(119905119894+1 minus 119905119894)119889119905rfloor lfloor119886rfloor is the greatest integerless than 119886Step 6 For the observation set O119905119894+1 = 119868(1199101 119905119894+1) 119868(1199101198721015840 119905119894+1) at 119905119894+1 with cardinality 1198721015840 generate the simulateddensity at the sample nodes 119910119897 119897 = 1 1198721015840 as follows

119903 (119910119897 119905119894+1) = sum1198721119895=1119870119901ℎ3 (119910119897 minus 119909119895) sdot 119903 (119909119895 119905119894+1)sum1198721119895=1119870119901ℎ3 (119910119897 minus 119909119895) (10)

and construct the simulated likelihood function as follows

(O119905119894+1 1198901 1198901198721) = 1198721015840prod119897=1

(119891 (119910119897) 119903 (119910119897 119905119894+1))119868(119910119897 119905119894+1)sdot (119891 (119910119897) (1 minus 119903 (119910119897 119905119894+1)))1minus119868(119910119897119905119894+1) (11)

6 Complexity

The full information likelihood function for all observa-tion time can be constructed from (11) in the following waylowast (O119905119894 119894 = 1 119879 1198901 1198901198722)

= 119879prod119894=1

(O119905119894 1198901 1198901198722) (12)

The estimator of unknown edge function 119864 can be derivedfrom maximizing the simulated full information likelihoodfunction (12) by selecting appropriate 1198901 1198901198722 the finalestimator 119864lowast is constructed from the optimal 119890lowast1 119890lowast1198722 inthe way of (7)

Comparing to NPSML in Kristensen and Shin [29] thealgorithm in our study includes one extra sampling step todraw 1198722 random points from R119901 times R119901 which are usedfor approximating unknown 119864 In addition there are twokernel smoothing steps (Steps 4 and 6) regarding the densityfunction 119903 one for the initial density in the starting time 119905119894and the other for the end-time density at 119905119894+1 The two kernelsmoothing steps are not required when the total number ofnodes are small (a few hundred or a few thousand) in whichcase the whole set of nodes is directly used as the1198721 samplesdrawn in Step 2 However when the system has a giant nodeset (say millions) the sample size1198721 ≪ 119872 can be applied inorder to lift the computation efficiency Moreover the nodesets being observed at different observation time may notalways be identical it is more often the case that when a nodeis tracked to be uninfected at some time 119905 it will be regardedas safe and missing from the consecutive tracking in the nextfew observation time points In this interval-censor situationthe 1198721 sampled nodes and the two kernel smoothing stepsare needed to avoid the noise induced by censoring

As documented in Kristensen and Shin [29] Kukackaand Barunik [36] the NPSML estimator does not suffer fromthe ldquocurse of dimensionrdquo despite its nonparametric essencebecause the number of simulation samples is independentfrom the number of observation samples When the latter islarge the inefficiency induced by kernel smoothing vanishesduring the aggregation involved in the likelihood functionBy the same argument and the fact that in most real-world applications the number of observed nodes is giantour modified NPSML estimator is free from the curse ofdimensionality as well

34 A Fast Algorithm As shown in (9) the estimationprocedure requires repeated evaluation of the multiplicationbetween a 1198721 times 1198721 matrix and a 1198721 dimensional vectorthe computation complexity is of the order11987221 Although1198721can be taken as much smaller than the number of nodes inobservations (119872) it still has to increase as 119872 increases Sowhen 119872 is a giant number 1198721 has to be large as well thecomputation complexity of the entire estimation procedurewill be dominated by 11987221 In this section we propose a fastalgorithm which can reduce the computation complexity in(9) to be linearly dependent on 1198721 that is reasonable andimplementable in practice

The idea of the fast algorithm comes from the techniqueof agent-based simulation (ABS) In every iteration of ABS

every agent in the network is only required to interact withanother agent randomly picked from its neighbor In oursetting there is no strict ldquoneighborrdquo defined while it isstill possible to randomly pick one agent from the entirepopulation and the interaction is only counted on the givenagent and its randomly picked partner Formally Step 5 inprevious paragraph is split to three substeps

Step 5(1) For fixed 119905 and fixed 119909119895 isin 1199091 1199091198721 randomlypick one 119909119897(119895 119905) from 1199091 1199091198721Step 5(2) Compute119903 (119909119895 119905 + 119889119905) = 119903 (119909119895 119905) + (1 minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905) 119889119905 (13)

Step 5(3) Repeat the above two steps for all 119905 = 119905119896 119896 =0 1 lfloor(119905119894+1 minus 119905119894)119889119905rfloor minus 1 and for all 119905119894sComparing (9) and (13) the main difference is that the

inner product of vectors (ie the sum over 1199091 1199091198721) isreplaced with a scalar multiple so the resulting computationcomplexity for all1198721 nodes linearly depends on1198721 which issignificantly faster than the original algorithm

For the accuracy of the fast algorithm we claim that com-pared to the original algorithm the accuracy loss inducedby the fastness is controlled by a constant multiple of Δ119905 =max119905119894+1 minus 119905119894 for all 119894 In fact due to the randomness of119909119897(119895 119905)s it is easily to verify the following

(i) the expectation of the left hand side of (9) is identicalto the expectation of left hand side of (13)

(ii) denoteΔ(119895 119905) as the increment Δ(119895 119905) = (1minus119903(119909119895 119905))sdot119864(119909119895 119909119897(119895 119905))119903(119909119897(119895 119905) 119905) then for 119905119894 le 119905 1199051015840 le 119905119894+1 1 le119895 1198951015840 le 1198721 and all 119905119894s cov(Δ(119895 119905) Δ(1198951015840 1199051015840) | 119903(119909119895 119905119894)) le119905119894+1 minus 119905119894The property (i) and the identity for 1198951015840 = 119895 in (ii) are quitetrivial For 119905119894 lt 119905 lt 1199051015840 lt 119905119894+1 then cov(Δ(119895 119905) Δ(119895 1199051015840) |119903(119909119895 119905119894)) can be decomposed as the sum of the following twocomponents119860 = cov (Δ (119895 119905) (1 minus 119903 (119909119895 119905)) sdot 119864 (119909119895 119909119897 (119895 1199051015840))sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= var (119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840))le var (119903 (119909119895 119905) | 119903 (119909119895 119905119894)) = var (119903 (119909119895 119905)

minus 119903 (119909119895 119905119894) | 119903 (119909119895 119905119894)) le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894)2le (119905119894+1 minus 119905119894)2

Complexity 7

119861 = cov (Δ (119895 119905) (119903 (119909119895 1199051015840) minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 1199051015840)) sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= cov (1 minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840)minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840)) le cov (1minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840) minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))le 119864 (10038161003816100381610038161003816119903 (119909119895 1199051015840) minus 119903 (119909119895 119905)10038161003816100381610038161003816 | 119903 (119909119895 119905119894))le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894) le (119905119894+1 minus 119905119894)

(14)

where sdot infin is the 119871infin norm of a bounded valued functionThe above inequality holds straightforwardly from the fact 119903is bounded by 1 and its temporal derivative is given by (1)which is also uniformly bounded by 1 then the statement(ii) follows immediately

Using Property (i) (ii) and the law of large number itis straightforward that the difference between the likelihoodfunction constructed from (9) and by (13) is bounded by aconstant multiple of Δ119905 as the number of nodes119872 997888rarr infinIf we further require Δ119905 997888rarr 0 along with 119872 997888rarr infin thetwo types of calculation of the likelihood function would beasymptotically identical which leads to the same estimator tothe hidden network

Also notice that by the fast algorithm the choice of 119889119905 isindependent with the estimation accuracy so in practice itcan be selected directly as 119905119894+1 minus 119905119894 to increase the speed35 Block Network The NPSML algorithm constructed inprevious section can be further extended to make inferencefor the block network model As in many applications [3338 39] the existence of connection between two agents isonly relevant to the groups they belong to and the features ofagents only affect which group they are assigned to Withoutloss of generality the set of 119876 groups can be considered as apartition of the set of all nodes then the edge function canbe decomposed as two components

(i) the group weight function 1198641 R119901 997888rarr [0 1]119876(ii) the group-level edge weight 1198642 which is a 119876 times 119876

matrix with each entry valued in [0 1]The edge function 119864 for the block network model can berecovered from (i) and (ii) as follows119864 (119909 119910) = 1198641 (119909)⊤ 11986421198641 (119910) (15)

where the image of 1198641 is viewed as a119876-dimensional columnsvector and the subscript ⊤ represents vector transpose The

group weight function is required to satisfy that for every 119909and 1198641(119909) = (1199041 119904119876) there exist only one 119894 isin 1 119876with 119904119894 gt 0 which means every node can only have positiveprobability to belong to at most one group which guaranteesthe requirement that groups constitute a partition of the nodeset

The estimation of block network is equivalent to theestimation of (1) the group weight function 1198641 which isunknown and consists of the fully nonparametric componentof the network and (2) the interaction matrix 1198642 which is theparametric component of the network So the estimation isessentially semiparametric The six-step algorithm discussedin Section 33 and the fast algorithm in Section 34 are stillapplicable to that case The only modification is for Step 3where the kernel smoothing method is no longer applied tothe unknown edge weight 119864 Instead it is applied to generatethe estimate to group weight 1198641 Then the hidden weightfunction 119864 is constructed from the kernel smoothed 1198641 andthe given interaction matrix 1198642 in the way of (15)

Block network model has many advantages For instancewhen the number of groups involved is small and does notdepend on the number of nodes the number of parametersto solve is only1198721119876+1198762 while the number is1198722 when thereis no block structure at all To generate good approximationto the true edge function 1198722 has to increase along withthe number 11987221 (although slowly) when the node numberin observation is giant 1198721 has to be large as well then1198722 ≫ 1198721119876 + 1198762 Through block network we can sharplyreduce the dimension of parameter space when solving themaximum likelihood problem which can significantly lift thecomputation efficiency

In addition block network is much easier to identifythan the general fully nonparametric networks which will bediscussed in the next section Finally under block networkthe equilibrium infectious distribution of the spreading pro-cess has a clear analytic expression as stated in the followingproposition (proof for Proposition 1 is quite trivial henceomitted)

Proposition 1 Denote 1198641119894 (119909) as the projection of vector 1198641(119909)to its 119894th coordinate Define G119894 = 119909 isin R119901 1198641119894 (119909) gt 0that consists of the set of nodes belonging to group 119894 thenwithin a mean-fieldmodel of the form (2) with edge function 119864given by (15) every equilibrium infection distribution 119903(119909) (iesatisfying (1 minus 119903(119909)) sdot int119901

R119864(119909 119910)119903(119910)119889119865(119910) equiv 0) must have the

following form119903 (119909)= 0 119894119891 119909 isin G119894 P119894 (1198642)119899 119903 (119910 1199050) equiv 0 119891119900119903 119886119897119897 119910 119899 gt 01 119890119897119904119890 (16)

where 119903(119910 1199050) is the prescribed initial distribution of infectiousstatusP119894 is the projection of a vector to its 119894th dimension and(1198642)119899 denotes the 119899th power of matrix 1198642

Proposition 1 is meaningful in the sense that it links thetypes of equilibria infectious distribution with the matrix

8 Complexity

algebra facilitating the qualitative analysis of the equilibriadistribution For instance when 1198642 is an upper trianglematrix with all its lower off-diagonal entries being zero andall diagonal and upper off-diagonal entries being strictlypositive such as in (17)

(((((

119909 119909 119909 sdot sdot sdot 1199090 119909 119909 d0 sdot sdot sdot 119909 sdot sdot sdot 119909 d 0 119909 1199090 sdot sdot sdot 0 0 119909)))))

(17)

then the equilibriumdistribution 119903 and the initial distribution119903( 1199050) satisfy the relation119903 (119909) = 1 iff 119909 isin 1198761015840⋃

119894=1

G119894 lArrrArr119903 (119909 1199050) gt 0 iff 119909 isin 119876⋃

119894=1198761015840+1

G119894

(18)

36 Validity of NPSML Due to the nonparametric natureof the edge function 119864 its identifiability is tricky When thespreading process can be observed for multiple times (119898times) with random initializations and 119898 is large as assumedin Roudi and Hertz [41] Shen et al [40] both of the fullynonparametric network 119864 and the block network (1198641 1198642)are identifiable However in real applications a spreadingprocess can at most be observed for a few times it is notexpected that 119898 can be very large In that case the fullynonparametric edge function 119864 is no longer fully identifiableie there exists 119864 = 1198641015840 that leads to the same likelihoodfunction (6) in the limit case However it can be shownthat 119864 is identifiable up to compact convex set ie the setS1198640119903(1199050)119864 119871(O119905 119864) = 119871(O119905 1198640) is a compact convex setwithin the function space 1198712(R119901 times R119901) where 1198640 stands forthe true value of edge function It can also be proved that thesetS1198640119903(1199050) also varies along with the initial infectious status119903( 1199050) Formally we have that 119864 isin S1198640119903(1199050) if and only if thefollowing holds for all 119899 = 1 (M1minus119903(1199050)K119864)119899 119903 ( 1199050) equiv (M1minus119903(1199050)K1198640)119899 119903 ( 1199050) (19)

where K119864 is a bounded operator over the functionalspace 1198712(R119901 defined through 119864 as (K119864119892)(119909) fl int

R119901119864(119909119910)119892(119910)119889119865(119910) for every 119892 isin 1198712(R119901) with 119865 being the

default node distribution M119891 is the multiplicative operatordetermined by 119891 such that (M119891119892)(119909) = 119891(119909) sdot 119892(119909) the 119899thpower in (19) represents the self-composition of an operatorfor 119899 times (19) implies that the identifiability of the true edgefunction 1198640 is limited by the extent of the ergodicity of thespreading process within the node space R119901 For instancewhen there exists a small open set 119880 sub R119901 such that allnodes 119909 isin 119880 are infected before the initial time 1199050 ie119903(119909 1199050) equiv 1 for all 119909 isin 119880 then it can be verified by (19)

that all functions 119864 that deviate from 1198640 only within the bandset 119880 times R119901 are contained in S1198640 On the other hand if thereexists open 1198801015840 sub R119901 such that (M1minus119903(1199050)K1198640)119899119903(119909 1199050) equiv 0for all 119909 isin 1198801015840 and all 119899 then all functions 119864 that deviatefrom 1198640 only within 1198801015840 times 1198801015840 are contained in S1198640119903(1199050) Inboth of the two cases nodes in 119880 or 1198801015840 are not in the ergodicrange of the spreading process hence the transmission oftheir infectious status is not observable For nodes in119880 theirinfections occur ahead of the observation period hence notobservable after the start of spreading while for nodes in 1198801015840it can be verified that they will never be infected over theentire spreading processTherefore the identifiability of 1198640 isrestricted by the experience of the spreading process whichis reasonable

It is still an open question what conditions added to 1198640andor 119903( 1199050) can guarantee the identifiability of the fullynonparametric 1198640 But in the special case of block networksone simple identifiability condition can be figured out Infact for block networks it is straightforward that (11986410 11986420)is identifiable if and only if there does not exist a (1198641 1198642)pair that differs from the true (11986410 11986420) but leads to the samelikelihood function (6) in the limit case if and only if forthe true 11986420 the vector space spanned by the family of vectorsV119905 119905 ge 1199050 is the entire feature space R119876 ie V119905 119905 ge 1199050has full rank 119876 is the number of blocks V119905 = (V1199051 V119905119876)⊤is a 119876-dimensional column vector for every 119905 and for each119902 = 1 119876 V119905119902 = intR119901 11986410119902(119909)119903(119909 119905)119889119865(119909) 11986410119902 is the 119902thentry of 11986410(119909) To reach the full rank condition the well-known Wronskian determinant [51] can be applied leadingto the following clean-form identifiability condition

det V1199050 diag (119888 minus V1199050) 11986420V1199050 (diag (119888 minus V1199050) 11986420)119876minus1sdot V1199050 = 0 (20)

where 119888 is the other 119876-dimensional column vector (1198881 119888119876)⊤ determined by the true 11986410 function such that 119888119902 =intR11990111986410119902(119909)119889119865(119909) for 119902 = 1 119876 diag is the operation that

convert a 119876-dimensional vector to a 119876 times 119876 matrix with itsdiagonal elements being the given vector By the polynomialnature of the determinant function it can be verified that (20)holds ldquogenericallyrdquo in the sense that the set of 1198642s that forces(20) to be constantly equal to 0 is contained in an 119876 times 119876 minus 1dimensional surface within [0 1]119876times119876 and for those 1198642s that(20) is not constantly 0 the set of V1199050 that forces (20) to be 0 isonly contained in a119876minus 1 dimensional surface within [0 1]119876Therefore (20) holds for almost all 1198642 and V1199050 except forsome extreme cases that have measure 0 under the standardLebesgue measure

The ldquoalmostrdquo identifiability for block networks guaranteesthat in most cases when the number of observed nodesis large and the distribution of observation time is densethe estimated 1198641 and 1198642 from the NPSML asymptoticallyconverge to their true values and point-wisely follow multi-variate normal distributions This asymptotic result followsstraightforwardly from Kristensen and Shin [29] Kukacka

Complexity 9

and Barunik [36] and the general properties of maximumlikelihood estimator So the theoretical validity of the esti-mators developed in previous sections is established

Remark 2 (sparsity) Although in general the complete iden-tifiability for both the general network and the block networkis hard to achieve but if we follow the idea in the networkreconstruction literature Shen et al [40] only concentrateon the case that the hidden network is as sparse as possiblein the sense the 1198712 norm of the edge weight function11986422 = intR119901timesR119901(119864(119909 119910))2119889119865(119909)119889119865(119910) for the general networkandor the entry-wise square sum of the block network119864222 = sum119894119895(1198642119894119895)2 (this is the 1198712 norm on the discreteset with cardinality 1198762) is as small as possible To automatethe selection of the sparsest network we can consider the1198712 norm function as a penalty and subtract it from thelog-likelihood function (6) and then optimizing (6) wouldguarantee the solution converging to the sparsest networkIt is easily verified that such a sparse solution is alwaysasymptotically unique because as we discussed in previousparagraphs all networks that can lead to exactly the samelog-likelihood function form a compact convex set in thefunctional space by the compactness and convexity therealways exists a unique 119864 (or 1198642) such that its 1198712-distance tothe origin reaches the minimum

4 Numerical Experiment with Synthetic Data

Two synthetic data sets are generated from simulation totest the effectiveness of the NPSML estimator designed inprevious sections one for the fully nonparametric networkand the other for the block network For both examples thenode set N consists of 200 nodes which are drawn purelyrandomly from the unit cube [0 1)2 thus these nodes followthe uniform distribution Consider the following modelsetup

Example 1 (full nonparametric network) Edge function 119864 isnegatively proportional to the standard Euclidean distancebetween two nodes ie

119864 (119909 119910) = 1 minus radic⟨119909 minus 119910 119909 minus 119910⟩2 (21)

Example 2 (block network) Set 119876 = 3 block membershipfunction 1198641 satisfies

1198641 (119909 119910) = (1 0 0) 119894119891 119909 + 1199102 lt 13 (0 1 0) 119894119891 13 ge 119909 + 1199102 lt 23 (0 0 1) 119890119897119904119890 (22)

Matrix 1198642 is given as follows

1198642 = ( 0 1 0508 0 03001 0 0 ) (23)

For both examples the spreading process is initializedas that 30 of all nodes are infected at the very beginningand the infected nodes are randomly picked from the nodeset The full spreading process is generated from a discreteversion of (2) with sufficiently small time step (eg 119889119905 = 001that makes the resulting distribution flows as the first-orderapproximation to the true flows) a coarse time step (119889119905 = 01)is used for the estimation procedure (9) in order to test therobustness The process is followed up until day 5 ie thetime horizon in this simulation study is [0 119905) with 119905 = 5The observation of the distribution flows is supposed to beavailable only at the initial time and the end of every day iethere are 6 chances to observe the distribution of infectionsat 119905 = 0 1 2 3 4 5

For the fully nonparametric Example 1 the spreadingprocess is regenerated for 100 times with 100 random initial-izations this is necessary to address the identification issuesas pointed out in Section 36 For the 100 trails both the nodeset and the initial infectious subset are regenerated althoughtheir distributions are held constant For the block networkExample 2 the spreading process is generated only once inorder to evaluate the fitting performance under the situationthat no repeated observation of the spreading process isavailable For both examples the estimated edge function isevaluated on afixed set of grids for easy comparisonwhere thegrid set forms a lattice of the unit cube ieG = (01119896 01119897) 119896 119897 = 0 1 10

If all nodes are included in the computation of theNPSML estimator there are in principle a 40000(= 200 times200)-dimensional parameter space for full nonparametricnetwork Example 1 and a 609(= 200times3+3times3)-dimensionalparameter space for block network Example 2 to be searchedwhich are too time consuming As in the introduction ofNPSML estimator by the smoothness of edge functionthe number of nodes actually used to evaluate the edgefunction can be much smaller than the size of the entirenode set So to reduce computation load we generate another1198721 = 20 nodes from the uniform distribution which will beused in Step 3 (Section 33) for simulating the distributionfunction 119903 Accordingly the 1198722 = 400 node pairs willbe selected as the product of the 20 nodes for the fullynonparametric Example 1 then there are 400 parameters tooptimize in Example 1 and the size is quite reasonable formost nonparametric tasks For the block network Example 2as no node pairs are needed for block networks there areonly 69(= 20 times 3 + 3 times 3) parameters to optimize As for theselection of kernel width ℎ1 ℎ2 and ℎ3 we set ℎ1 = 400minus15ℎ2 = 200minus13 and ℎ3 = 20minus13 This is because the kernelsmooth method requires kernel width ℎ to satisfy 119899ℎ119896 997888rarr infinand 119899ℎ119896+2 997888rarr 0 in order to guarantee the consistency andasymptotic normality [28 29 36 52] where 119899 is input samplesize and 119896 is the dimension of the data By a rule of thumbwe select the kernel width as ℎ = 119899minus1(119896+1) For ℎ1 it is onlyused in Example 1 to estimate the edge function where thesample size is1198722 = 400 and the data dimension is two timesof the dimension of node space thus 119896 is 4 For ℎ2 and ℎ3they are used in both examples for estimating the distributionfunction 119903 thus data dimension 119896 is always 2The sample size

10 Complexity

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

(a) Fitting accuracy for fully nonparametric network

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

00

00

(b) Fitting accuracy for block network

Figure 1 Fitting accuracy for networks in Examples 1 and 2

for ℎ2 is 200 because it is used to turn the real observed 119903 on200 nodes to its kernel smooth version and the sample size forℎ3 is 20 because it turns the estimated 119903 on 20 sampled nodesto its values on the full node set

For the inference of the block network the number ofblock119876 is usually not known in prior so it is also a parameterto estimate As119876 determines the model dimension we adoptthe classical Bayesian information criteria (BIC) introducedin Schwarz [53] to detect the correct model dimension Asdefined in Schwarz [53] the greater BIC for a fitted modelimplies the better explanatory power [53] therefore the bestchoice of119876 corresponds to the maximal BIC In practice it isnot possible to calculate the BIC value for all positive119876 so wefollow the convention and only compute the BIC on a smallset of 119876 isin 1 2 3 4 5 The 119876 associated with the maximalBIC and the corresponding estimates of 1198641 1198642 are selectedas the final estimators and reported in the following In ourexample the correct119876 = 3 is always achieved so we omit thistrivial result

In Figure 1 we plot the difference between the real edgefunction and the NPSML estimated edge function on the setG timesG of node pairs for both examples where the horizontalaxis represents the true value of edge weight on every nodepair and the vertical axis represents the estimated weighton the same node pair To facilitate visualization Figure 1is sorted according to the horizontal axis in an ascendingmannerThe red dots represent the pairs of (estimatedweighttrue weight) the blue line sketches the identity function 119910 =119909 therefore a red dot being closer to the blue line meansthe better fitting accuracy Apparently for most of nodepairs the difference is negligible To further verify this visualjudgement 1205942 test is carried out for every node pair (119909 119910) isinGtimesGwith the null hypothesis 119864119909119910 = (119864(119909 119910)minus119864(119909 119910))2 = 0Following the asymptotic normality of NPSML estimator 119864 atevery (119909 119910) the distribution of test statistics 1198641199091199101205902119909119910 undernull hypothesis should be a 1205942 distribution with degree offreedom 1 where 120590119909119910 is the asymptotic variance of estimator119864(119909 119910) which can be calculated by bootstrap method Wecount the number of node pairs that fail to support the nullhypothesis at 90 credential level the result shows that in

Table 1 Estimation accuracy of 1198642Entries Bias Std P value119864211 0021 0032 0468119864212 -0006 0012 0383119864213 -0003 0029 0057119864221 -0001 0029 0028119864222 0022 0022 066119864223 -0002 0028 0059119864231 0005 0024 0165119864232 0018 0029 048119864233 0016 0021 0554

both examples only less than 10 out of all 10000 evaluationpairs in G times G fail to support the null hypothesis So ourestimation accuracy is quite satisfactory which agrees withthe visualization in Figure 1

For the block network Example 2 Table 1 presents theentry-wise accuracy of estimated 1198642 relative to (23) thefirst column presents the estimation bias the second andthird columns are the empirical standard deviation and theempirical P-values of the estimates from which we canconclude that the fitting accuracy is relatively perfect

For robustness check we also consider the synthetic datagenerated for different 119889119905 isin 001 005 01 015 02 and theimplementation of NPSML estimation on node samples withdifferent size 1198721 and 1198722 When 1198721 and 1198722 are increasedto 100 and 10 000 respectively no significant difference canbe detected in terms of the estimation accuracy measured bythe entry-wise bias between the true and the estimated edgeweight so we omit to plot this result For the rejection ratio at90 credential level of the null hypothesis that the true andestimated edgeweight are identical this ratio is lowered downa bit for the block network to less than 6 but no significantdecreasing can be detected for the general network exampleThis observation might be caused by the fact that for generalnetwork there are much more free parameters to estimatewhich reduces the convergence speed As for the different119889119905 the variation of estimation accuracy is not significant in

Complexity 11

all aspects this fact agrees with the discussion in the end ofSection 34

5 Experiment with Rumor Spreadingon Twitter

To demonstrate the usefulness of the NPSML method inreal-world applications we carry out an experiment with thedistribution flow data of a real rumor spreading process onTwitter We collect a data set of tweet articles with regardto the famous event ldquoUnite the Right rallyrdquo The ldquoUnite theRight rallyrdquo also known as the Charlottesville rally or Char-lottesville riots was a white supremacist rally that occurredin Charlottesville Virginia from August 11 to 12 2017 Therally occurred amidst the backdrop of controversy generatedby the removal of Confederate monuments throughout thecountry in response to the Charleston church shootingin 2015 The event turned violent after protesters clashedwith counter-protesters leaving over 30 injured The rallyalso attracted wide attentions on Twitter Twitter users ledvigilante campaigns on the platforms to personally identifyand denounce individual marchers in the rally following thestart of the campaignmany of themarchers were shamed andvilified by the social media community with several of therally attendees being dismissed from their jobs as a result ofthe campaign

Although the rally occurred in Charlottesville originallymessages andor comments related to it are immediatelyspread out through Twitter to users in many other placesincluding all major cities in US which inspired subsequentvigils and demonstrations in a number of cities across thecountry in the following days from Aug 11 and 12 2017 Tothis event we collect a time series of user level information(during the time from Aug 11 to Sep 4 2017) that recordedall Twitter user accounts in 20+ cities that spread at leastonce any messagecomment related to the rally during thecollection period We also collect the reaction time of everyuser to relevant messages and the user-specific informationsuch as the number of followers friends that an user has andhow many tweets the user has published in the past (historyposts) In addition the registration location of the Twitteraccount and its corresponding latitude and longitude are alsocollected

Similar to most rumor spreading data it is not possible totrack how every single message is spread from user to user byour collected data thus there is no way to directly identifythe interaction network among users But it is possible togenerate the distribution flows of users who have joined thespreading process Formally we can define at each time point119905 that a user has joined the process if and only if by 119905 heshehas at least reacted once to the messagescomments relatedto the rally then the data set can be easily converted to day-by-day distribution flows where at every time (day) 119905 sincethe origin (Aug 11 2017) we have an 119873-dimensional 0 1-valued vector with119873 being the number of all users in recordThe 119894th coordinate takes value 1 if and only if the 119894th user hasreacted to the rally-message at least once by 119905

For such a distribution flow data set we are interestedin making inference of features of the interaction network

between users because they are useful for making predictionfor the other spreading processes on Twitter regarding thesimilar social events To that end we apply the NPSMLmethod to estimate the hidden interaction network from theflow data Since there are 100000+ users in our record andit is likely that many users belong to the same latent group sothat their response pattern is similar to their common groupmembers it is more appropriate to assume the interactionnetwork behind our flow data is a block network and thenapply the NPSML to the block network model discussed inSection 35

To uncover the dependence of interaction links betweenusers on their geographical features andor friendshipfol-lowership relation we embed nodes(users) of the interac-tion network into a 5-dimensional feature space with thecoordinates representing the latitude longitude of accountlocation the number of friends followers and history postsrespectively To reduce the computation burden we adopt thebootstrap method randomly pick 10000 users from the fullset of users for 10 times and estimate the block network oneach of the subsamples For every subsample an estimatorfor membership weight function 1198641 and interaction matrix1198642 can be derived The aggregated estimator for interactionmatrix 1198642 is averaged over all subsample estimators for theblock membership weight 1198641 the aggregated estimator isderived by maximum a posteriori from the set of subsampleestimators

For robustness check we select 119889119905 isin 001 005 01 02to solve (9) As block network is used there is no need todraw the1198722 samples of node pairs only1198721 sampled nodesare needed for evaluating 119903 To reduce computation burdenwe consider to take a much smaller 1198721 than the number ofall users in record (10000+) to approximate the membershipweight function 1198641 and distribution function 119903 To check therobustness of our estimation with respect to different choiceof1198721 we preliminarily run the estimation program on a setof different 1198721 isin 50 100 200 500 The feature vectorof the 1198721 nodes in each trail is selected by conducting aK-means clustering on the full sample with the number ofclusters equal to 1198721 then the set of cluster centres will beselected as the feature vector Such selected feature vectorfor the1198721 nodes distributes asymptotically in the same waywithin the feature space as for the full sample of nodes Thepreliminary result shows that the estimators are not sensitiveto different choice of 119889119905 and become stable when1198721 is greaterthan 50 Therefore we will fix 119889119905 = 02 and 1198721 = 100 the100 cluster centres are also used as the evaluation nodes forthe estimated function 1198641

The choice of best block number is still based on max-imization of BIC value We plot the BIC for the three casesthat the block number equals to 3 4 and 5 in Figure 2 andthe BIC reaches its maximum when block number is 4 so weconsider a block network with 4 blocks as the final model forfurther analysis

Different visualizations of the block network are pro-vided Figure 3 sketches the geographic range of everyblockcommunity of the Twitter network the amount offollowers friends and history posts is plotted along with

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 2: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

2 Complexity

interaction network As a result it becomes necessary todevelop more data-driven approaches tailored to uncoverthe hidden spatial social network behind the distributionflows

Distribution flows are frequently studied in the field ofrumor andor flu spreading Existing methods in broadterms can be suppressed into two classes the agent-basedmodellingsimulationcalibration (ABM) techniques [9ndash15]and the differential equation (DE) based approaches [16ndash19]The class of DE approaches is helpful to derive qualitativeconclusions regarding the steady state distribution of thespreading processes and how the equilibrium depends onmodel parameter in a coarse sense However to guaranteethe meaningful qualitative results are achievable the setupof differential equations is often oversimplified but it wouldcause the loss of insights into the complex reality In additiondue to the lack of explicit solution in most cases it is notpossible to apply the DE techniques to fit the real dataand generate detailed quantitative results In contrast ABMapproaches are more realistic and suitable for quantitativeresearch on the real distribution flowsHowever there are stilla couple of shortages in the existing ABMmodels [20]

First ABM often assumes that the spreading processis carried on a network where nodes represent agents thatcan potentially spread out or be infected with a certaintype of object (eg rumor) edges are the links betweenagents Rumor can only be spread between agents linked byedges Under ABM framework this interaction network issupposed to be known and prescribed in prior Prior networkmay lose critical information of the interaction patterns ofpopulation [15 21 22] For instance in the Twitter network anatural interaction network structure is the network formedby friendship or followership relation between users which isalso frequently used as the prior network for rumor spreadingstudies [14 18] However rumor does not have to followthis network to spread [23] In fact the retweet action ofbig name users is more likely to be visible through otherchannels to those users who are not linked in the friendshipnetwork such as by TV shows and newspapers Thereforethe spreading between a big name user and an ordinary useris still possible even if they are not linked at all by merelycounting the friend or follower relation The existence ofhidden links makes prior network fail to capture all structuralfeatures of interactions a data-driven or posterior networkwould be helpful to overcome this issue

Second given the prior network ABM assumes thespreading occurs through interaction mechanisms betweentwo randomly picked agents Widely used interaction mecha-nisms include the independent cascade model and the linearthreshold model and so on [24ndash26] These mechanisms areoften parametrized and assumed homogeneous for all agentsie the mechanism is determined by a set of parametersthat are constant and invariant for different agents In realityboth of the relative positions of an agent within the networksuch as the degree centrality betweenness of an agent [1718 27] and many social-economic factors external to theentire network such as the geographic location social statuseducation level and wealth [22 27] can drastically affectthe likelihood that agents get infected by the rumor But

the heterogeneity among agents is often missing from thestandard ABM framework

To resolve the above issues we propose a novel and com-pletely data-driven modelling approach to characterize thehidden interaction network and the spreading process Ourstudy contributes to the existing literature in the followingaspects

First we consider the interaction network as a weightedmultidimensional spatial social network which is an exten-sion to the standard spatial network and the nodes in thenetwork are embedded into amultidimensional feature spaceR119901 The weighted edge between nodes is considered as acontinuous function onR119901 timesR119901 Within such a network thevalue of edge weight function can depend on features of boththe start nodes and end nodes so it gives full respect to theheterogeneity of nodes and its effect on shaping the spreadingprocess and distribution flow

Second we link the interaction network with the dis-tribution flows by the classical mean-field models [9 16ndash18] and the law of distribution transition is realized by akernel operator with its kernel function given by the edgeweight function Such a construction allows the infectionstatus of a given node to depend on all other nodes in thenetwork in a smooth manner which avoids the arbitrarinessof distinguishing the impact of neighbor and nonneighbornodes while also facilitating the inclusion of the contextinformation embedded in the spatial social network into theanalysis of spreading

Third we adopt the kernel smoothing technique andnonparametric likelihood estimation from statistics [28 29]to fit our model into real distribution flows where theentire edge weight function is supposed to be unknown andneeds to be estimated from the distribution flow data fromthe real world The nonparametricity makes our method apowerful tool of information mining for distribution flowdata Finally the widely used block models [30ndash33] canbe easily incorporated into our framework which helpsbetter uncover hidden social-economic connections betweenindividuals from distribution flows

The paper is organized as follows In Section 2 wegive an overview of existing methods of network estima-tion Section 3 formally presents the setup of our methodincluding the definition of feature space network mean-fieldmodels and their simulation techniques and the design ofour likelihood estimators Section 4 validates the effectivenessof our estimators to the hidden network by synthetic dataand numerical experiments Section 5 applies our method toa distribution flow dataset of the information spreading onTwitter relevant to the event ldquoUnite the Right rallyrdquo 2017

2 Relevant Methods

The proposed method in this paper is essentially a networkestimation tool while network estimation is a long-standingtopic in many different fields

In the studies of agent-based model (ABM) simulation-based estimation is usually adopted to calibrate the unknownparameters involved in model setup [13ndash15 18 34 35]Simulation-based estimation is efficient in dealing with the

Complexity 3

estimation of ABMs as it is often impossible to derive ananalytic expression for the standard error functions in ABMsetting simulation can help generate an empirical version ofthe error function and facilitate the application of standardordinary least square (OLS) and maximum likelihood (ML)estimation strategy However the simulation-based estima-tion is more frequently applied to parametric ABM whereonly a finite-dimensional parameter vector is to be estimatedit is rarely used to estimate the hidden network structure asthe unknown network is essentially nonparametric whichis less tractable than the parametric models To our bestknowledge the only exception comes from Grazzini andRichiardi [35] Kukacka and Barunik [36] in which theinteraction mechanism when two agents meet is allowedto include a nonparametric component and the kernelsmoothing method and nonparametric likelihood (or leastsquare) estimators are applied to cope withmodel estimationHowever Grazzini and Richiardi [35] Kukacka and Barunik[36] do not include the interaction network between agentsinto their analysis nor the model identifiability issue isresolved thus further exploration is needed in this direction

The other related works deal with link prediction bystochastic-network models In this field nonparametrictricks are more often adopted to make inference of hiddenfeatures of stochastic network [23 31 32 37 38] Lu and Zhou[31] review the main-stream heuristic algorithms to forecastthe missing links within a partially observed network Bickelet al [39] from the perspective of statistic inference sum-marize and validate the application of variational expectationmaximization (VEM) algorithm to infer the probability ofexistence of a link between two nodes from observed edgedata Matias et al [38] extend the VEM method to deal withthe future occurrence probability of edges given a dynamiclinked network and the historic edge data this extendedmethod can handle the case where the evolution of occur-rence probability depends nonparametrically on an unknownhazard function All these methods were developed under acommonassumption that at least the edge information of partof the network has already been observed which is possiblefor trajectory data but not possible for distribution flowsThus a further extension is needed to handle the case thatall edge data are missing

In the literature of physics the task of detecting thehidden network link structure from node-level time-seriesdata is phrased as ldquonetwork reconstructionrdquo Taking distri-bution flows as the input two outstanding network recon-structionmethodologies are directly comparable to oursOneis based on the compressive sensing technique as proposedin Shen et al [40] the other is based on the combinationof likelihood estimation and the mean-field approximationtechnique as discussed inRoudi andHertz [41]Thebasic ideain Shen et al [40] is to convert the network reconstructionproblem to a classical convex optimization problem withlinear constraints which is the so-called compressive sensing(CS) problem In the CS problem the linear constraintscome from the transition probability of nodes within thenetwork from the uninfected state to the infected state whilethe objective function arises from the sparsity assumptionregarding the network link structure Unlike the applications

of CS approach to the network reconstruction from contin-uous time-series data [42ndash44] where the feature variablesassociated with every node are directly observable in the caseof distribution flows the key variable transition probabilityis not observable from the data Therefore it has to becalculated so as to form the required linear constraintsInferring the transition probability from the 0 1-valueddistribution flow data requires a stationary assumption onthe underlying model which is too restrictive in manyapplications For instance in the spreading of virus an agentmight die immediately after it is infected in which case theinfected agent is censored in the sense that its infectiousstatus is constantly one since the time of being infectedWhen censored agents exist in the network stationarity ofthe transition is impossible and the CS framework in Shenet al [40] is no longer applicable The other problem ofthe CS framework is its incapability of handling the spatialheterogeneity among different nodes As we have highlightedthat the education wealth and many other social-economicfactors can play critical roles to determine the link strengthamong people and therefore affect the information spreadingdynamics modelling the dependence of the hidden linkstructure on those social-economic factors is necessary in thestudies of social network The inclusion of social-economicfactors would introduce heterogeneity among nodes whichmakes it challenging to identify which two nodes are rel-atively homogeneous and can be grouped together In theCS framework grouping different nodes is the premise tocalculate the transition probability In an abstract networkall nodes are homogeneous and the grouping can be simplytaken as the set of all nodes as done in Shen et al [40] whilein a spatial network with heterogeneity widely existing such asimple grouping trick is meaningless How to extend the CSframework to spatial social network becomes a tough job andextensive studies are needed

The deep reason that restricts the CS framework is itsreliance on the unobservable transition probability Thatrestriction can be effectively resolved by applying the likeli-hood technique as suggested in Roudi and Hertz [41] Thegoodness of likelihood-based approach is that it can com-pute the unknown transition probability simultaneously withthe other model parameters But the computation usuallytakes too much time because there is no explicit solutionfor the first-order condition of the maximum likelihoodnumerical solution is required To make the computationeasier a mean-filed approximation technique is presentedin Roudi and Hertz [41] which can definitely increase thecomputation speed However the approximation can onlywork for the case that all link strengths have to be close tozero which restricts its usefulness in many applications ofsocial network On the other hand the current version ofthe approximation technique in Roudi and Hertz [41] stillassumes an abstract network structure and no dependenceof the link strength on social-economic factors is allowedit is unclear whether the approximation is extendible toaccount for the reconstruction of spatial social networksFinally Roudi and Hertz [41] are only concerned with thesituation that the number of nodes (119873) is relatively small andthe computation complexity comes mainly from numerically

4 Complexity

solving the maximum likelihood problem But when 119873 islarge the computation complexity would be dominated bythe matrix multiplication for the 119873 times 119873 adjacency matrixSince the approximation technique in Roudi and Hertz [41]still requires the matrix multiplication its speed-up effect forgiant networks may not be that significant More explorationson the fast reconstruction of giant spatial social networks areneeded

3 Model Setup

31 Feature Space Network We consider a weighted multidi-mensional spatial social network where nodes of the networkare considered as elements in a 119901-dimensional EuclideanspaceR119901 and every dimension ofR119901 is interpreted as a featureof nodes thus R119901 is interpretable as a feature space Edgesbetween nodes are assumed to depend on features of nodesin a smooth way ie edge set of the graph is equivalentto a smooth function (up to a certain order of derivatives)or an almost-everywhere smooth function (ie the functionis smooth for all points except those contained in a zero-measure set) denoted as 119864 R119901 times R119901 997888rarr [0 1] wherewithout loss of generality edge weight between two nodesis restrained within the unit interval Such a specificationadmits a stochastic-network interpretation of our model theweight can be thought of as the probability that two nodesshare an edge Since the nodes of the network may not beevenly distributed within the entire space R119901 without loss ofgenerality we assume the nodersquos distribution is characterizedby a probability measure 119865 on R119901 and 119865 is supposed tobe known from the data In sum the 119901-dimensional spatialnetwork can be recorded as 119866(R119901 119864 119865) or shortly 119866 whenthere is no ambiguity regarding its nodes space distributionand edge function

There are several advantages to assume that the spreadingprocess and distribution flows occurred within 119866 First theembedding of the node set into feature space R119901 allows us tocharacterize the feature information of nodes that are externalto the network structure [21 22 27] which are usually asimportant as the network structure itself in determining thespreading process and distribution flows Luo et al [22] arguethat including social-economic factors such as the intensityof population gathering in a set of locations can significantlyincrease the capacity of forecast of illness spreading amongresidents Viboud et al [45] report similar findings Secondallowing nodes unevenly distributed within the feature spaceadmits us to include more general network into analysis Forinstance by proper choice of the measure 119865 (eg finitelysupported) it is even possible to consider a network withonly finitely many nodes but sitting in the infinite featurespace R119901 this allows us to include most of networks thatwe can meet in practice Finally allowing the edge weight tosmoothly depend on features of both the flow-in and flow-out nodes makes it possible to incorporate the backgroundinformation into the interaction mechanism this is criticalwhen the network itself is only a small component of a largerbackground system [27] In addition a by-product of treatingedges as a smooth function is its induced computationalefficiency In fact when a network consists of a giant number

of nodes even a simple summation operation can take a longtime and huge memory but when edges vary smoothly alongwith nodes it becomes possible to only do calculation on asmall set of nodes and the global features of edges then canbe inferred from the result on the relatively small set by thekernel smoothing technique from nonparametric statistics[28 29] Based on these advantages we will concentrate onthe spatial network 119866(R119901 119864 119865) instead of a more generalconcept of network

32 Mean-Field Models To model spreading processeswithin a spatial network 119866(R119901 119864 119865) we follow the conven-tion in the studies in rumor spreading literature [10 17] andadopt the common assumption that a rumor can be spreadout from a node 119909 to the other 119910 if and only if (1) the initialnode 119909must have been infected with the rumor recorded asthe event 119868(119909) = 1 (2) there is an edge between them orequivalently 119864(119909 119910) gt 0 and (3) when condition (1) and (2)hold whether or not the spreading actually happens is purelyrandom up to a probability 119903 Different spreading modelsimpose different requirement on the probability 119903 In thecurrent studies we adopt the mean-field model to determine119903 as suggested inmost of previous studies Formally for everyfixed time 119905 the probability of node 119909 isin R119901 being infected isdetermined by the following mean-field equation119889119903 (119909 119905)119889119905 = (1 minus 119903 (119909 119905)) sdot int

R119901119864 (119909 119910) 119903 (119910 119905) 119889119865 (119910) (1)

The interpretation of (1) is that at 119905 the temporal variationrate of the probability that node 119909 is infected (represented as119889119903(119909 119905)119889119905) is a proportion to the probability that node 119909 hasnot yet been infected by time 119905 (represented as 1 minus 119903(119909 119905))and the proportion is determined through a weighted sum ofthe probability of all other nodes in the network having beeninfected by 119905 The weight function describes the strength ofconnection between nodes 119909 and 119910 thus can be formulatedas the edge function 119864 Using the classical result ofmean-fieldequations [46ndash50] it can be easily verified that the infectionprobability 119903(119909 119905) in (1) is exactly equal to the probabilityof 119868(119909 119905) = 1 for a given right-continuous mean-field pointprocess 119868 satisfying the following119864 (119868 (119909 119905) minus 119868 (119909 119905minus) | 119868 (119909 119905minus) = 0)= int

R119901119864 (119909 119910) 119868 (119910 119905minus) 119889119865 (119910) (2)

where 119868(119909 119905minus) is the left-limit of process 119868(119909 sdot) The interpre-tation of (2) is more straightforward than (1) (2) points outthat the average rate of node 119909 being infected is contributedby all those nodes that (1) have a connection to 119909 and (2) havebeen infected by the current time These two conditions areoften imposed in literature

Let 119903 be a function satisfying the functional differentialequation (1) also denote 119891 as the density or mass functionassociated with probability 119865 then the event that a givennode 119909 is observed at time 119905 and its infectious status isobserved to be infected has the probability density

p1 (119909 119905) = 119891 (119909) 119903 (119909 119905) (3)

Complexity 5

in contrast the density for the event that 119909 is observed to beuninfected at 119905 is given as

p0 (119909 119905) = 119891 (119909) (1 minus 119903 (119909 119905)) (4)

Suppose that given a time 119905 the infectious status of a set ofrandomly picked nodesN isin R119901 is observable and represent-ed as

O119905 = 119868 (119909 119905) 119909 isinN (5)

with 119868(119909 119905) = 0 being not infected and 119868(119909 119905) = 1 beinginfected then the likelihood function of the observations O119905can be written in the following way by using (3) and (4)119871 (O119905 119864)= prod

119909isinN

(119891 (119909) 119903 (119909 119905))119868(119909119905) (119891 (119909) (1 minus 119903 (119909 119905)))1minus119868(119909119905) (6)

where we add the edge function 119864 into likelihood becauseit affects 119871 through determining the functional form of 119903Maximizing (6) can yield the classical maximum likelihood(ML) estimator of 11986433 Nonparametric Likelihood Estimator and Kernel Smooth-ing In the study of spreading process only the distributionflows of the form (5) are available the details of link structurebetween nodes represented by edge function 119864 are notobservable thus need to be estimated In this section weconstruct a nonparametric simulated maximum likelihoodestimator (NPSML) to the functional form of 119864 given theobserved distribution flows O119905119894 119894 = 1 119879 1199051 lt sdot sdot sdot lt119905119879 on a sequence of time The NPSML is an efficient non-parametric inference technique proposed by Kristensen andShin [29] NPSML applies well to the case where an explicitexpression of the likelihood function is not achievable whichis exactly what we need to handle because the distributionfunction 119903 in (6) is the solution to the functional differentialequation (1) there is no clean analytic expression available forit

However our task is different from the situation discussedoriginally in Kristensen and Shin [29] First the originalNPSML applies nonparametric kernel smoothing to approxi-mate the unknown likelihood function the model generatingthe likelihood function is still parametric but in (6) thelikelihood depends on the nonparametric edge function 119864To this situation one extra kernel smoothing step is needed toapproximate119864 Second in Kristensen and Shin [29] Kukackaand Barunik [36] simulation is conducted on the level ofrandom variable while in our case simulation is on thelevel of distribution that is equivalent to numerically solvethe mean-field equation (1) Finally due to the involvementof nonparametric model setup the model identifiability hasto be checked in order to guarantee the correctness of theresulting estimation

Due to the first and second differences we provide thefollowing algorithm to generate the simulated likelihoodfunction (in the following constructions we always use119870119901 to

denote the119901-dimensional standardGaussian kernel function119870119901ℎ(119909) = 119870119901(119909ℎ)ℎ119901 for some positive constant ℎ)

Step 1 Select constant 119889119905 gt 0 large positive integer 1198721and 1198722 (119889119905 is the length of every time step used fornumerically solving the functional differential equation (1)1198721 and1198722 are the number of random samples that will bedrawn to generate the kernel smoothing approximation to theunknown likelihood function and edge weight function)

Step 2 Draw 1198721 random samples 1199091 1199091198721 isin R119901 fromdistribution 119865 and1198722 random samples 1199081 1199081198722 isin R119901 timesR119901 from the product measure 119865 otimes 119865Step 3 Given 1198901 1198901198722 isin [0 1] construct function 119864 asfollows

119864 (119908) = sum1198722119894=11198702119901ℎ1 (119908 minus 119908119894) sdot 119890119894sum1198722119895=11198702119901ℎ1 (119908 minus 119908119895) (7)

Step 4 Given 119905119894 let O119905119894 = 119868(1199101 119905119894) 119868(119910119872 119905119894) denote theobservation set at time 119905119894 whose cardinality is119872 constructingfunction 119903( 119905119894) as follows

119903 (119910 119905119894) = sum119872119897=1119870119901ℎ2 (119910 minus 119910119897) sdot 119868 (119910119897 119905119894)sum119872119895=1119870119901ℎ2 (119910 minus 119910119895) (8)

Step 5 Solve mean-field equation (1) over interval [119905119894 119905119894+1) atthe set of sample point 1199091 1199091198721 drawn in Step 2 byEulerrsquosmethod with time step 119889119905 subject to the initial condition119903( 119905119894) as follows119903 (119909119895 119905119894 + (119896 + 1) sdot 119889119905)= 119903 (119909 119905119894 + 119896 sdot 119889119905) + (1 minus 119903 (119909119895 119905119894 + 119896 sdot 119889119905)) sdot 1198891199051198721

sdot 1198721sum119897=1

119864 (119909119895 119909119897) 119903 (119909119897 119905119894 + 119896 sdot 119889119905)(9)

where 119896 = 0 1 lfloor(119905119894+1 minus 119905119894)119889119905rfloor lfloor119886rfloor is the greatest integerless than 119886Step 6 For the observation set O119905119894+1 = 119868(1199101 119905119894+1) 119868(1199101198721015840 119905119894+1) at 119905119894+1 with cardinality 1198721015840 generate the simulateddensity at the sample nodes 119910119897 119897 = 1 1198721015840 as follows

119903 (119910119897 119905119894+1) = sum1198721119895=1119870119901ℎ3 (119910119897 minus 119909119895) sdot 119903 (119909119895 119905119894+1)sum1198721119895=1119870119901ℎ3 (119910119897 minus 119909119895) (10)

and construct the simulated likelihood function as follows

(O119905119894+1 1198901 1198901198721) = 1198721015840prod119897=1

(119891 (119910119897) 119903 (119910119897 119905119894+1))119868(119910119897 119905119894+1)sdot (119891 (119910119897) (1 minus 119903 (119910119897 119905119894+1)))1minus119868(119910119897119905119894+1) (11)

6 Complexity

The full information likelihood function for all observa-tion time can be constructed from (11) in the following waylowast (O119905119894 119894 = 1 119879 1198901 1198901198722)

= 119879prod119894=1

(O119905119894 1198901 1198901198722) (12)

The estimator of unknown edge function 119864 can be derivedfrom maximizing the simulated full information likelihoodfunction (12) by selecting appropriate 1198901 1198901198722 the finalestimator 119864lowast is constructed from the optimal 119890lowast1 119890lowast1198722 inthe way of (7)

Comparing to NPSML in Kristensen and Shin [29] thealgorithm in our study includes one extra sampling step todraw 1198722 random points from R119901 times R119901 which are usedfor approximating unknown 119864 In addition there are twokernel smoothing steps (Steps 4 and 6) regarding the densityfunction 119903 one for the initial density in the starting time 119905119894and the other for the end-time density at 119905119894+1 The two kernelsmoothing steps are not required when the total number ofnodes are small (a few hundred or a few thousand) in whichcase the whole set of nodes is directly used as the1198721 samplesdrawn in Step 2 However when the system has a giant nodeset (say millions) the sample size1198721 ≪ 119872 can be applied inorder to lift the computation efficiency Moreover the nodesets being observed at different observation time may notalways be identical it is more often the case that when a nodeis tracked to be uninfected at some time 119905 it will be regardedas safe and missing from the consecutive tracking in the nextfew observation time points In this interval-censor situationthe 1198721 sampled nodes and the two kernel smoothing stepsare needed to avoid the noise induced by censoring

As documented in Kristensen and Shin [29] Kukackaand Barunik [36] the NPSML estimator does not suffer fromthe ldquocurse of dimensionrdquo despite its nonparametric essencebecause the number of simulation samples is independentfrom the number of observation samples When the latter islarge the inefficiency induced by kernel smoothing vanishesduring the aggregation involved in the likelihood functionBy the same argument and the fact that in most real-world applications the number of observed nodes is giantour modified NPSML estimator is free from the curse ofdimensionality as well

34 A Fast Algorithm As shown in (9) the estimationprocedure requires repeated evaluation of the multiplicationbetween a 1198721 times 1198721 matrix and a 1198721 dimensional vectorthe computation complexity is of the order11987221 Although1198721can be taken as much smaller than the number of nodes inobservations (119872) it still has to increase as 119872 increases Sowhen 119872 is a giant number 1198721 has to be large as well thecomputation complexity of the entire estimation procedurewill be dominated by 11987221 In this section we propose a fastalgorithm which can reduce the computation complexity in(9) to be linearly dependent on 1198721 that is reasonable andimplementable in practice

The idea of the fast algorithm comes from the techniqueof agent-based simulation (ABS) In every iteration of ABS

every agent in the network is only required to interact withanother agent randomly picked from its neighbor In oursetting there is no strict ldquoneighborrdquo defined while it isstill possible to randomly pick one agent from the entirepopulation and the interaction is only counted on the givenagent and its randomly picked partner Formally Step 5 inprevious paragraph is split to three substeps

Step 5(1) For fixed 119905 and fixed 119909119895 isin 1199091 1199091198721 randomlypick one 119909119897(119895 119905) from 1199091 1199091198721Step 5(2) Compute119903 (119909119895 119905 + 119889119905) = 119903 (119909119895 119905) + (1 minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905) 119889119905 (13)

Step 5(3) Repeat the above two steps for all 119905 = 119905119896 119896 =0 1 lfloor(119905119894+1 minus 119905119894)119889119905rfloor minus 1 and for all 119905119894sComparing (9) and (13) the main difference is that the

inner product of vectors (ie the sum over 1199091 1199091198721) isreplaced with a scalar multiple so the resulting computationcomplexity for all1198721 nodes linearly depends on1198721 which issignificantly faster than the original algorithm

For the accuracy of the fast algorithm we claim that com-pared to the original algorithm the accuracy loss inducedby the fastness is controlled by a constant multiple of Δ119905 =max119905119894+1 minus 119905119894 for all 119894 In fact due to the randomness of119909119897(119895 119905)s it is easily to verify the following

(i) the expectation of the left hand side of (9) is identicalto the expectation of left hand side of (13)

(ii) denoteΔ(119895 119905) as the increment Δ(119895 119905) = (1minus119903(119909119895 119905))sdot119864(119909119895 119909119897(119895 119905))119903(119909119897(119895 119905) 119905) then for 119905119894 le 119905 1199051015840 le 119905119894+1 1 le119895 1198951015840 le 1198721 and all 119905119894s cov(Δ(119895 119905) Δ(1198951015840 1199051015840) | 119903(119909119895 119905119894)) le119905119894+1 minus 119905119894The property (i) and the identity for 1198951015840 = 119895 in (ii) are quitetrivial For 119905119894 lt 119905 lt 1199051015840 lt 119905119894+1 then cov(Δ(119895 119905) Δ(119895 1199051015840) |119903(119909119895 119905119894)) can be decomposed as the sum of the following twocomponents119860 = cov (Δ (119895 119905) (1 minus 119903 (119909119895 119905)) sdot 119864 (119909119895 119909119897 (119895 1199051015840))sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= var (119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840))le var (119903 (119909119895 119905) | 119903 (119909119895 119905119894)) = var (119903 (119909119895 119905)

minus 119903 (119909119895 119905119894) | 119903 (119909119895 119905119894)) le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894)2le (119905119894+1 minus 119905119894)2

Complexity 7

119861 = cov (Δ (119895 119905) (119903 (119909119895 1199051015840) minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 1199051015840)) sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= cov (1 minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840)minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840)) le cov (1minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840) minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))le 119864 (10038161003816100381610038161003816119903 (119909119895 1199051015840) minus 119903 (119909119895 119905)10038161003816100381610038161003816 | 119903 (119909119895 119905119894))le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894) le (119905119894+1 minus 119905119894)

(14)

where sdot infin is the 119871infin norm of a bounded valued functionThe above inequality holds straightforwardly from the fact 119903is bounded by 1 and its temporal derivative is given by (1)which is also uniformly bounded by 1 then the statement(ii) follows immediately

Using Property (i) (ii) and the law of large number itis straightforward that the difference between the likelihoodfunction constructed from (9) and by (13) is bounded by aconstant multiple of Δ119905 as the number of nodes119872 997888rarr infinIf we further require Δ119905 997888rarr 0 along with 119872 997888rarr infin thetwo types of calculation of the likelihood function would beasymptotically identical which leads to the same estimator tothe hidden network

Also notice that by the fast algorithm the choice of 119889119905 isindependent with the estimation accuracy so in practice itcan be selected directly as 119905119894+1 minus 119905119894 to increase the speed35 Block Network The NPSML algorithm constructed inprevious section can be further extended to make inferencefor the block network model As in many applications [3338 39] the existence of connection between two agents isonly relevant to the groups they belong to and the features ofagents only affect which group they are assigned to Withoutloss of generality the set of 119876 groups can be considered as apartition of the set of all nodes then the edge function canbe decomposed as two components

(i) the group weight function 1198641 R119901 997888rarr [0 1]119876(ii) the group-level edge weight 1198642 which is a 119876 times 119876

matrix with each entry valued in [0 1]The edge function 119864 for the block network model can berecovered from (i) and (ii) as follows119864 (119909 119910) = 1198641 (119909)⊤ 11986421198641 (119910) (15)

where the image of 1198641 is viewed as a119876-dimensional columnsvector and the subscript ⊤ represents vector transpose The

group weight function is required to satisfy that for every 119909and 1198641(119909) = (1199041 119904119876) there exist only one 119894 isin 1 119876with 119904119894 gt 0 which means every node can only have positiveprobability to belong to at most one group which guaranteesthe requirement that groups constitute a partition of the nodeset

The estimation of block network is equivalent to theestimation of (1) the group weight function 1198641 which isunknown and consists of the fully nonparametric componentof the network and (2) the interaction matrix 1198642 which is theparametric component of the network So the estimation isessentially semiparametric The six-step algorithm discussedin Section 33 and the fast algorithm in Section 34 are stillapplicable to that case The only modification is for Step 3where the kernel smoothing method is no longer applied tothe unknown edge weight 119864 Instead it is applied to generatethe estimate to group weight 1198641 Then the hidden weightfunction 119864 is constructed from the kernel smoothed 1198641 andthe given interaction matrix 1198642 in the way of (15)

Block network model has many advantages For instancewhen the number of groups involved is small and does notdepend on the number of nodes the number of parametersto solve is only1198721119876+1198762 while the number is1198722 when thereis no block structure at all To generate good approximationto the true edge function 1198722 has to increase along withthe number 11987221 (although slowly) when the node numberin observation is giant 1198721 has to be large as well then1198722 ≫ 1198721119876 + 1198762 Through block network we can sharplyreduce the dimension of parameter space when solving themaximum likelihood problem which can significantly lift thecomputation efficiency

In addition block network is much easier to identifythan the general fully nonparametric networks which will bediscussed in the next section Finally under block networkthe equilibrium infectious distribution of the spreading pro-cess has a clear analytic expression as stated in the followingproposition (proof for Proposition 1 is quite trivial henceomitted)

Proposition 1 Denote 1198641119894 (119909) as the projection of vector 1198641(119909)to its 119894th coordinate Define G119894 = 119909 isin R119901 1198641119894 (119909) gt 0that consists of the set of nodes belonging to group 119894 thenwithin a mean-fieldmodel of the form (2) with edge function 119864given by (15) every equilibrium infection distribution 119903(119909) (iesatisfying (1 minus 119903(119909)) sdot int119901

R119864(119909 119910)119903(119910)119889119865(119910) equiv 0) must have the

following form119903 (119909)= 0 119894119891 119909 isin G119894 P119894 (1198642)119899 119903 (119910 1199050) equiv 0 119891119900119903 119886119897119897 119910 119899 gt 01 119890119897119904119890 (16)

where 119903(119910 1199050) is the prescribed initial distribution of infectiousstatusP119894 is the projection of a vector to its 119894th dimension and(1198642)119899 denotes the 119899th power of matrix 1198642

Proposition 1 is meaningful in the sense that it links thetypes of equilibria infectious distribution with the matrix

8 Complexity

algebra facilitating the qualitative analysis of the equilibriadistribution For instance when 1198642 is an upper trianglematrix with all its lower off-diagonal entries being zero andall diagonal and upper off-diagonal entries being strictlypositive such as in (17)

(((((

119909 119909 119909 sdot sdot sdot 1199090 119909 119909 d0 sdot sdot sdot 119909 sdot sdot sdot 119909 d 0 119909 1199090 sdot sdot sdot 0 0 119909)))))

(17)

then the equilibriumdistribution 119903 and the initial distribution119903( 1199050) satisfy the relation119903 (119909) = 1 iff 119909 isin 1198761015840⋃

119894=1

G119894 lArrrArr119903 (119909 1199050) gt 0 iff 119909 isin 119876⋃

119894=1198761015840+1

G119894

(18)

36 Validity of NPSML Due to the nonparametric natureof the edge function 119864 its identifiability is tricky When thespreading process can be observed for multiple times (119898times) with random initializations and 119898 is large as assumedin Roudi and Hertz [41] Shen et al [40] both of the fullynonparametric network 119864 and the block network (1198641 1198642)are identifiable However in real applications a spreadingprocess can at most be observed for a few times it is notexpected that 119898 can be very large In that case the fullynonparametric edge function 119864 is no longer fully identifiableie there exists 119864 = 1198641015840 that leads to the same likelihoodfunction (6) in the limit case However it can be shownthat 119864 is identifiable up to compact convex set ie the setS1198640119903(1199050)119864 119871(O119905 119864) = 119871(O119905 1198640) is a compact convex setwithin the function space 1198712(R119901 times R119901) where 1198640 stands forthe true value of edge function It can also be proved that thesetS1198640119903(1199050) also varies along with the initial infectious status119903( 1199050) Formally we have that 119864 isin S1198640119903(1199050) if and only if thefollowing holds for all 119899 = 1 (M1minus119903(1199050)K119864)119899 119903 ( 1199050) equiv (M1minus119903(1199050)K1198640)119899 119903 ( 1199050) (19)

where K119864 is a bounded operator over the functionalspace 1198712(R119901 defined through 119864 as (K119864119892)(119909) fl int

R119901119864(119909119910)119892(119910)119889119865(119910) for every 119892 isin 1198712(R119901) with 119865 being the

default node distribution M119891 is the multiplicative operatordetermined by 119891 such that (M119891119892)(119909) = 119891(119909) sdot 119892(119909) the 119899thpower in (19) represents the self-composition of an operatorfor 119899 times (19) implies that the identifiability of the true edgefunction 1198640 is limited by the extent of the ergodicity of thespreading process within the node space R119901 For instancewhen there exists a small open set 119880 sub R119901 such that allnodes 119909 isin 119880 are infected before the initial time 1199050 ie119903(119909 1199050) equiv 1 for all 119909 isin 119880 then it can be verified by (19)

that all functions 119864 that deviate from 1198640 only within the bandset 119880 times R119901 are contained in S1198640 On the other hand if thereexists open 1198801015840 sub R119901 such that (M1minus119903(1199050)K1198640)119899119903(119909 1199050) equiv 0for all 119909 isin 1198801015840 and all 119899 then all functions 119864 that deviatefrom 1198640 only within 1198801015840 times 1198801015840 are contained in S1198640119903(1199050) Inboth of the two cases nodes in 119880 or 1198801015840 are not in the ergodicrange of the spreading process hence the transmission oftheir infectious status is not observable For nodes in119880 theirinfections occur ahead of the observation period hence notobservable after the start of spreading while for nodes in 1198801015840it can be verified that they will never be infected over theentire spreading processTherefore the identifiability of 1198640 isrestricted by the experience of the spreading process whichis reasonable

It is still an open question what conditions added to 1198640andor 119903( 1199050) can guarantee the identifiability of the fullynonparametric 1198640 But in the special case of block networksone simple identifiability condition can be figured out Infact for block networks it is straightforward that (11986410 11986420)is identifiable if and only if there does not exist a (1198641 1198642)pair that differs from the true (11986410 11986420) but leads to the samelikelihood function (6) in the limit case if and only if forthe true 11986420 the vector space spanned by the family of vectorsV119905 119905 ge 1199050 is the entire feature space R119876 ie V119905 119905 ge 1199050has full rank 119876 is the number of blocks V119905 = (V1199051 V119905119876)⊤is a 119876-dimensional column vector for every 119905 and for each119902 = 1 119876 V119905119902 = intR119901 11986410119902(119909)119903(119909 119905)119889119865(119909) 11986410119902 is the 119902thentry of 11986410(119909) To reach the full rank condition the well-known Wronskian determinant [51] can be applied leadingto the following clean-form identifiability condition

det V1199050 diag (119888 minus V1199050) 11986420V1199050 (diag (119888 minus V1199050) 11986420)119876minus1sdot V1199050 = 0 (20)

where 119888 is the other 119876-dimensional column vector (1198881 119888119876)⊤ determined by the true 11986410 function such that 119888119902 =intR11990111986410119902(119909)119889119865(119909) for 119902 = 1 119876 diag is the operation that

convert a 119876-dimensional vector to a 119876 times 119876 matrix with itsdiagonal elements being the given vector By the polynomialnature of the determinant function it can be verified that (20)holds ldquogenericallyrdquo in the sense that the set of 1198642s that forces(20) to be constantly equal to 0 is contained in an 119876 times 119876 minus 1dimensional surface within [0 1]119876times119876 and for those 1198642s that(20) is not constantly 0 the set of V1199050 that forces (20) to be 0 isonly contained in a119876minus 1 dimensional surface within [0 1]119876Therefore (20) holds for almost all 1198642 and V1199050 except forsome extreme cases that have measure 0 under the standardLebesgue measure

The ldquoalmostrdquo identifiability for block networks guaranteesthat in most cases when the number of observed nodesis large and the distribution of observation time is densethe estimated 1198641 and 1198642 from the NPSML asymptoticallyconverge to their true values and point-wisely follow multi-variate normal distributions This asymptotic result followsstraightforwardly from Kristensen and Shin [29] Kukacka

Complexity 9

and Barunik [36] and the general properties of maximumlikelihood estimator So the theoretical validity of the esti-mators developed in previous sections is established

Remark 2 (sparsity) Although in general the complete iden-tifiability for both the general network and the block networkis hard to achieve but if we follow the idea in the networkreconstruction literature Shen et al [40] only concentrateon the case that the hidden network is as sparse as possiblein the sense the 1198712 norm of the edge weight function11986422 = intR119901timesR119901(119864(119909 119910))2119889119865(119909)119889119865(119910) for the general networkandor the entry-wise square sum of the block network119864222 = sum119894119895(1198642119894119895)2 (this is the 1198712 norm on the discreteset with cardinality 1198762) is as small as possible To automatethe selection of the sparsest network we can consider the1198712 norm function as a penalty and subtract it from thelog-likelihood function (6) and then optimizing (6) wouldguarantee the solution converging to the sparsest networkIt is easily verified that such a sparse solution is alwaysasymptotically unique because as we discussed in previousparagraphs all networks that can lead to exactly the samelog-likelihood function form a compact convex set in thefunctional space by the compactness and convexity therealways exists a unique 119864 (or 1198642) such that its 1198712-distance tothe origin reaches the minimum

4 Numerical Experiment with Synthetic Data

Two synthetic data sets are generated from simulation totest the effectiveness of the NPSML estimator designed inprevious sections one for the fully nonparametric networkand the other for the block network For both examples thenode set N consists of 200 nodes which are drawn purelyrandomly from the unit cube [0 1)2 thus these nodes followthe uniform distribution Consider the following modelsetup

Example 1 (full nonparametric network) Edge function 119864 isnegatively proportional to the standard Euclidean distancebetween two nodes ie

119864 (119909 119910) = 1 minus radic⟨119909 minus 119910 119909 minus 119910⟩2 (21)

Example 2 (block network) Set 119876 = 3 block membershipfunction 1198641 satisfies

1198641 (119909 119910) = (1 0 0) 119894119891 119909 + 1199102 lt 13 (0 1 0) 119894119891 13 ge 119909 + 1199102 lt 23 (0 0 1) 119890119897119904119890 (22)

Matrix 1198642 is given as follows

1198642 = ( 0 1 0508 0 03001 0 0 ) (23)

For both examples the spreading process is initializedas that 30 of all nodes are infected at the very beginningand the infected nodes are randomly picked from the nodeset The full spreading process is generated from a discreteversion of (2) with sufficiently small time step (eg 119889119905 = 001that makes the resulting distribution flows as the first-orderapproximation to the true flows) a coarse time step (119889119905 = 01)is used for the estimation procedure (9) in order to test therobustness The process is followed up until day 5 ie thetime horizon in this simulation study is [0 119905) with 119905 = 5The observation of the distribution flows is supposed to beavailable only at the initial time and the end of every day iethere are 6 chances to observe the distribution of infectionsat 119905 = 0 1 2 3 4 5

For the fully nonparametric Example 1 the spreadingprocess is regenerated for 100 times with 100 random initial-izations this is necessary to address the identification issuesas pointed out in Section 36 For the 100 trails both the nodeset and the initial infectious subset are regenerated althoughtheir distributions are held constant For the block networkExample 2 the spreading process is generated only once inorder to evaluate the fitting performance under the situationthat no repeated observation of the spreading process isavailable For both examples the estimated edge function isevaluated on afixed set of grids for easy comparisonwhere thegrid set forms a lattice of the unit cube ieG = (01119896 01119897) 119896 119897 = 0 1 10

If all nodes are included in the computation of theNPSML estimator there are in principle a 40000(= 200 times200)-dimensional parameter space for full nonparametricnetwork Example 1 and a 609(= 200times3+3times3)-dimensionalparameter space for block network Example 2 to be searchedwhich are too time consuming As in the introduction ofNPSML estimator by the smoothness of edge functionthe number of nodes actually used to evaluate the edgefunction can be much smaller than the size of the entirenode set So to reduce computation load we generate another1198721 = 20 nodes from the uniform distribution which will beused in Step 3 (Section 33) for simulating the distributionfunction 119903 Accordingly the 1198722 = 400 node pairs willbe selected as the product of the 20 nodes for the fullynonparametric Example 1 then there are 400 parameters tooptimize in Example 1 and the size is quite reasonable formost nonparametric tasks For the block network Example 2as no node pairs are needed for block networks there areonly 69(= 20 times 3 + 3 times 3) parameters to optimize As for theselection of kernel width ℎ1 ℎ2 and ℎ3 we set ℎ1 = 400minus15ℎ2 = 200minus13 and ℎ3 = 20minus13 This is because the kernelsmooth method requires kernel width ℎ to satisfy 119899ℎ119896 997888rarr infinand 119899ℎ119896+2 997888rarr 0 in order to guarantee the consistency andasymptotic normality [28 29 36 52] where 119899 is input samplesize and 119896 is the dimension of the data By a rule of thumbwe select the kernel width as ℎ = 119899minus1(119896+1) For ℎ1 it is onlyused in Example 1 to estimate the edge function where thesample size is1198722 = 400 and the data dimension is two timesof the dimension of node space thus 119896 is 4 For ℎ2 and ℎ3they are used in both examples for estimating the distributionfunction 119903 thus data dimension 119896 is always 2The sample size

10 Complexity

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

(a) Fitting accuracy for fully nonparametric network

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

00

00

(b) Fitting accuracy for block network

Figure 1 Fitting accuracy for networks in Examples 1 and 2

for ℎ2 is 200 because it is used to turn the real observed 119903 on200 nodes to its kernel smooth version and the sample size forℎ3 is 20 because it turns the estimated 119903 on 20 sampled nodesto its values on the full node set

For the inference of the block network the number ofblock119876 is usually not known in prior so it is also a parameterto estimate As119876 determines the model dimension we adoptthe classical Bayesian information criteria (BIC) introducedin Schwarz [53] to detect the correct model dimension Asdefined in Schwarz [53] the greater BIC for a fitted modelimplies the better explanatory power [53] therefore the bestchoice of119876 corresponds to the maximal BIC In practice it isnot possible to calculate the BIC value for all positive119876 so wefollow the convention and only compute the BIC on a smallset of 119876 isin 1 2 3 4 5 The 119876 associated with the maximalBIC and the corresponding estimates of 1198641 1198642 are selectedas the final estimators and reported in the following In ourexample the correct119876 = 3 is always achieved so we omit thistrivial result

In Figure 1 we plot the difference between the real edgefunction and the NPSML estimated edge function on the setG timesG of node pairs for both examples where the horizontalaxis represents the true value of edge weight on every nodepair and the vertical axis represents the estimated weighton the same node pair To facilitate visualization Figure 1is sorted according to the horizontal axis in an ascendingmannerThe red dots represent the pairs of (estimatedweighttrue weight) the blue line sketches the identity function 119910 =119909 therefore a red dot being closer to the blue line meansthe better fitting accuracy Apparently for most of nodepairs the difference is negligible To further verify this visualjudgement 1205942 test is carried out for every node pair (119909 119910) isinGtimesGwith the null hypothesis 119864119909119910 = (119864(119909 119910)minus119864(119909 119910))2 = 0Following the asymptotic normality of NPSML estimator 119864 atevery (119909 119910) the distribution of test statistics 1198641199091199101205902119909119910 undernull hypothesis should be a 1205942 distribution with degree offreedom 1 where 120590119909119910 is the asymptotic variance of estimator119864(119909 119910) which can be calculated by bootstrap method Wecount the number of node pairs that fail to support the nullhypothesis at 90 credential level the result shows that in

Table 1 Estimation accuracy of 1198642Entries Bias Std P value119864211 0021 0032 0468119864212 -0006 0012 0383119864213 -0003 0029 0057119864221 -0001 0029 0028119864222 0022 0022 066119864223 -0002 0028 0059119864231 0005 0024 0165119864232 0018 0029 048119864233 0016 0021 0554

both examples only less than 10 out of all 10000 evaluationpairs in G times G fail to support the null hypothesis So ourestimation accuracy is quite satisfactory which agrees withthe visualization in Figure 1

For the block network Example 2 Table 1 presents theentry-wise accuracy of estimated 1198642 relative to (23) thefirst column presents the estimation bias the second andthird columns are the empirical standard deviation and theempirical P-values of the estimates from which we canconclude that the fitting accuracy is relatively perfect

For robustness check we also consider the synthetic datagenerated for different 119889119905 isin 001 005 01 015 02 and theimplementation of NPSML estimation on node samples withdifferent size 1198721 and 1198722 When 1198721 and 1198722 are increasedto 100 and 10 000 respectively no significant difference canbe detected in terms of the estimation accuracy measured bythe entry-wise bias between the true and the estimated edgeweight so we omit to plot this result For the rejection ratio at90 credential level of the null hypothesis that the true andestimated edgeweight are identical this ratio is lowered downa bit for the block network to less than 6 but no significantdecreasing can be detected for the general network exampleThis observation might be caused by the fact that for generalnetwork there are much more free parameters to estimatewhich reduces the convergence speed As for the different119889119905 the variation of estimation accuracy is not significant in

Complexity 11

all aspects this fact agrees with the discussion in the end ofSection 34

5 Experiment with Rumor Spreadingon Twitter

To demonstrate the usefulness of the NPSML method inreal-world applications we carry out an experiment with thedistribution flow data of a real rumor spreading process onTwitter We collect a data set of tweet articles with regardto the famous event ldquoUnite the Right rallyrdquo The ldquoUnite theRight rallyrdquo also known as the Charlottesville rally or Char-lottesville riots was a white supremacist rally that occurredin Charlottesville Virginia from August 11 to 12 2017 Therally occurred amidst the backdrop of controversy generatedby the removal of Confederate monuments throughout thecountry in response to the Charleston church shootingin 2015 The event turned violent after protesters clashedwith counter-protesters leaving over 30 injured The rallyalso attracted wide attentions on Twitter Twitter users ledvigilante campaigns on the platforms to personally identifyand denounce individual marchers in the rally following thestart of the campaignmany of themarchers were shamed andvilified by the social media community with several of therally attendees being dismissed from their jobs as a result ofthe campaign

Although the rally occurred in Charlottesville originallymessages andor comments related to it are immediatelyspread out through Twitter to users in many other placesincluding all major cities in US which inspired subsequentvigils and demonstrations in a number of cities across thecountry in the following days from Aug 11 and 12 2017 Tothis event we collect a time series of user level information(during the time from Aug 11 to Sep 4 2017) that recordedall Twitter user accounts in 20+ cities that spread at leastonce any messagecomment related to the rally during thecollection period We also collect the reaction time of everyuser to relevant messages and the user-specific informationsuch as the number of followers friends that an user has andhow many tweets the user has published in the past (historyposts) In addition the registration location of the Twitteraccount and its corresponding latitude and longitude are alsocollected

Similar to most rumor spreading data it is not possible totrack how every single message is spread from user to user byour collected data thus there is no way to directly identifythe interaction network among users But it is possible togenerate the distribution flows of users who have joined thespreading process Formally we can define at each time point119905 that a user has joined the process if and only if by 119905 heshehas at least reacted once to the messagescomments relatedto the rally then the data set can be easily converted to day-by-day distribution flows where at every time (day) 119905 sincethe origin (Aug 11 2017) we have an 119873-dimensional 0 1-valued vector with119873 being the number of all users in recordThe 119894th coordinate takes value 1 if and only if the 119894th user hasreacted to the rally-message at least once by 119905

For such a distribution flow data set we are interestedin making inference of features of the interaction network

between users because they are useful for making predictionfor the other spreading processes on Twitter regarding thesimilar social events To that end we apply the NPSMLmethod to estimate the hidden interaction network from theflow data Since there are 100000+ users in our record andit is likely that many users belong to the same latent group sothat their response pattern is similar to their common groupmembers it is more appropriate to assume the interactionnetwork behind our flow data is a block network and thenapply the NPSML to the block network model discussed inSection 35

To uncover the dependence of interaction links betweenusers on their geographical features andor friendshipfol-lowership relation we embed nodes(users) of the interac-tion network into a 5-dimensional feature space with thecoordinates representing the latitude longitude of accountlocation the number of friends followers and history postsrespectively To reduce the computation burden we adopt thebootstrap method randomly pick 10000 users from the fullset of users for 10 times and estimate the block network oneach of the subsamples For every subsample an estimatorfor membership weight function 1198641 and interaction matrix1198642 can be derived The aggregated estimator for interactionmatrix 1198642 is averaged over all subsample estimators for theblock membership weight 1198641 the aggregated estimator isderived by maximum a posteriori from the set of subsampleestimators

For robustness check we select 119889119905 isin 001 005 01 02to solve (9) As block network is used there is no need todraw the1198722 samples of node pairs only1198721 sampled nodesare needed for evaluating 119903 To reduce computation burdenwe consider to take a much smaller 1198721 than the number ofall users in record (10000+) to approximate the membershipweight function 1198641 and distribution function 119903 To check therobustness of our estimation with respect to different choiceof1198721 we preliminarily run the estimation program on a setof different 1198721 isin 50 100 200 500 The feature vectorof the 1198721 nodes in each trail is selected by conducting aK-means clustering on the full sample with the number ofclusters equal to 1198721 then the set of cluster centres will beselected as the feature vector Such selected feature vectorfor the1198721 nodes distributes asymptotically in the same waywithin the feature space as for the full sample of nodes Thepreliminary result shows that the estimators are not sensitiveto different choice of 119889119905 and become stable when1198721 is greaterthan 50 Therefore we will fix 119889119905 = 02 and 1198721 = 100 the100 cluster centres are also used as the evaluation nodes forthe estimated function 1198641

The choice of best block number is still based on max-imization of BIC value We plot the BIC for the three casesthat the block number equals to 3 4 and 5 in Figure 2 andthe BIC reaches its maximum when block number is 4 so weconsider a block network with 4 blocks as the final model forfurther analysis

Different visualizations of the block network are pro-vided Figure 3 sketches the geographic range of everyblockcommunity of the Twitter network the amount offollowers friends and history posts is plotted along with

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 3: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

Complexity 3

estimation of ABMs as it is often impossible to derive ananalytic expression for the standard error functions in ABMsetting simulation can help generate an empirical version ofthe error function and facilitate the application of standardordinary least square (OLS) and maximum likelihood (ML)estimation strategy However the simulation-based estima-tion is more frequently applied to parametric ABM whereonly a finite-dimensional parameter vector is to be estimatedit is rarely used to estimate the hidden network structure asthe unknown network is essentially nonparametric whichis less tractable than the parametric models To our bestknowledge the only exception comes from Grazzini andRichiardi [35] Kukacka and Barunik [36] in which theinteraction mechanism when two agents meet is allowedto include a nonparametric component and the kernelsmoothing method and nonparametric likelihood (or leastsquare) estimators are applied to cope withmodel estimationHowever Grazzini and Richiardi [35] Kukacka and Barunik[36] do not include the interaction network between agentsinto their analysis nor the model identifiability issue isresolved thus further exploration is needed in this direction

The other related works deal with link prediction bystochastic-network models In this field nonparametrictricks are more often adopted to make inference of hiddenfeatures of stochastic network [23 31 32 37 38] Lu and Zhou[31] review the main-stream heuristic algorithms to forecastthe missing links within a partially observed network Bickelet al [39] from the perspective of statistic inference sum-marize and validate the application of variational expectationmaximization (VEM) algorithm to infer the probability ofexistence of a link between two nodes from observed edgedata Matias et al [38] extend the VEM method to deal withthe future occurrence probability of edges given a dynamiclinked network and the historic edge data this extendedmethod can handle the case where the evolution of occur-rence probability depends nonparametrically on an unknownhazard function All these methods were developed under acommonassumption that at least the edge information of partof the network has already been observed which is possiblefor trajectory data but not possible for distribution flowsThus a further extension is needed to handle the case thatall edge data are missing

In the literature of physics the task of detecting thehidden network link structure from node-level time-seriesdata is phrased as ldquonetwork reconstructionrdquo Taking distri-bution flows as the input two outstanding network recon-structionmethodologies are directly comparable to oursOneis based on the compressive sensing technique as proposedin Shen et al [40] the other is based on the combinationof likelihood estimation and the mean-field approximationtechnique as discussed inRoudi andHertz [41]Thebasic ideain Shen et al [40] is to convert the network reconstructionproblem to a classical convex optimization problem withlinear constraints which is the so-called compressive sensing(CS) problem In the CS problem the linear constraintscome from the transition probability of nodes within thenetwork from the uninfected state to the infected state whilethe objective function arises from the sparsity assumptionregarding the network link structure Unlike the applications

of CS approach to the network reconstruction from contin-uous time-series data [42ndash44] where the feature variablesassociated with every node are directly observable in the caseof distribution flows the key variable transition probabilityis not observable from the data Therefore it has to becalculated so as to form the required linear constraintsInferring the transition probability from the 0 1-valueddistribution flow data requires a stationary assumption onthe underlying model which is too restrictive in manyapplications For instance in the spreading of virus an agentmight die immediately after it is infected in which case theinfected agent is censored in the sense that its infectiousstatus is constantly one since the time of being infectedWhen censored agents exist in the network stationarity ofthe transition is impossible and the CS framework in Shenet al [40] is no longer applicable The other problem ofthe CS framework is its incapability of handling the spatialheterogeneity among different nodes As we have highlightedthat the education wealth and many other social-economicfactors can play critical roles to determine the link strengthamong people and therefore affect the information spreadingdynamics modelling the dependence of the hidden linkstructure on those social-economic factors is necessary in thestudies of social network The inclusion of social-economicfactors would introduce heterogeneity among nodes whichmakes it challenging to identify which two nodes are rel-atively homogeneous and can be grouped together In theCS framework grouping different nodes is the premise tocalculate the transition probability In an abstract networkall nodes are homogeneous and the grouping can be simplytaken as the set of all nodes as done in Shen et al [40] whilein a spatial network with heterogeneity widely existing such asimple grouping trick is meaningless How to extend the CSframework to spatial social network becomes a tough job andextensive studies are needed

The deep reason that restricts the CS framework is itsreliance on the unobservable transition probability Thatrestriction can be effectively resolved by applying the likeli-hood technique as suggested in Roudi and Hertz [41] Thegoodness of likelihood-based approach is that it can com-pute the unknown transition probability simultaneously withthe other model parameters But the computation usuallytakes too much time because there is no explicit solutionfor the first-order condition of the maximum likelihoodnumerical solution is required To make the computationeasier a mean-filed approximation technique is presentedin Roudi and Hertz [41] which can definitely increase thecomputation speed However the approximation can onlywork for the case that all link strengths have to be close tozero which restricts its usefulness in many applications ofsocial network On the other hand the current version ofthe approximation technique in Roudi and Hertz [41] stillassumes an abstract network structure and no dependenceof the link strength on social-economic factors is allowedit is unclear whether the approximation is extendible toaccount for the reconstruction of spatial social networksFinally Roudi and Hertz [41] are only concerned with thesituation that the number of nodes (119873) is relatively small andthe computation complexity comes mainly from numerically

4 Complexity

solving the maximum likelihood problem But when 119873 islarge the computation complexity would be dominated bythe matrix multiplication for the 119873 times 119873 adjacency matrixSince the approximation technique in Roudi and Hertz [41]still requires the matrix multiplication its speed-up effect forgiant networks may not be that significant More explorationson the fast reconstruction of giant spatial social networks areneeded

3 Model Setup

31 Feature Space Network We consider a weighted multidi-mensional spatial social network where nodes of the networkare considered as elements in a 119901-dimensional EuclideanspaceR119901 and every dimension ofR119901 is interpreted as a featureof nodes thus R119901 is interpretable as a feature space Edgesbetween nodes are assumed to depend on features of nodesin a smooth way ie edge set of the graph is equivalentto a smooth function (up to a certain order of derivatives)or an almost-everywhere smooth function (ie the functionis smooth for all points except those contained in a zero-measure set) denoted as 119864 R119901 times R119901 997888rarr [0 1] wherewithout loss of generality edge weight between two nodesis restrained within the unit interval Such a specificationadmits a stochastic-network interpretation of our model theweight can be thought of as the probability that two nodesshare an edge Since the nodes of the network may not beevenly distributed within the entire space R119901 without loss ofgenerality we assume the nodersquos distribution is characterizedby a probability measure 119865 on R119901 and 119865 is supposed tobe known from the data In sum the 119901-dimensional spatialnetwork can be recorded as 119866(R119901 119864 119865) or shortly 119866 whenthere is no ambiguity regarding its nodes space distributionand edge function

There are several advantages to assume that the spreadingprocess and distribution flows occurred within 119866 First theembedding of the node set into feature space R119901 allows us tocharacterize the feature information of nodes that are externalto the network structure [21 22 27] which are usually asimportant as the network structure itself in determining thespreading process and distribution flows Luo et al [22] arguethat including social-economic factors such as the intensityof population gathering in a set of locations can significantlyincrease the capacity of forecast of illness spreading amongresidents Viboud et al [45] report similar findings Secondallowing nodes unevenly distributed within the feature spaceadmits us to include more general network into analysis Forinstance by proper choice of the measure 119865 (eg finitelysupported) it is even possible to consider a network withonly finitely many nodes but sitting in the infinite featurespace R119901 this allows us to include most of networks thatwe can meet in practice Finally allowing the edge weight tosmoothly depend on features of both the flow-in and flow-out nodes makes it possible to incorporate the backgroundinformation into the interaction mechanism this is criticalwhen the network itself is only a small component of a largerbackground system [27] In addition a by-product of treatingedges as a smooth function is its induced computationalefficiency In fact when a network consists of a giant number

of nodes even a simple summation operation can take a longtime and huge memory but when edges vary smoothly alongwith nodes it becomes possible to only do calculation on asmall set of nodes and the global features of edges then canbe inferred from the result on the relatively small set by thekernel smoothing technique from nonparametric statistics[28 29] Based on these advantages we will concentrate onthe spatial network 119866(R119901 119864 119865) instead of a more generalconcept of network

32 Mean-Field Models To model spreading processeswithin a spatial network 119866(R119901 119864 119865) we follow the conven-tion in the studies in rumor spreading literature [10 17] andadopt the common assumption that a rumor can be spreadout from a node 119909 to the other 119910 if and only if (1) the initialnode 119909must have been infected with the rumor recorded asthe event 119868(119909) = 1 (2) there is an edge between them orequivalently 119864(119909 119910) gt 0 and (3) when condition (1) and (2)hold whether or not the spreading actually happens is purelyrandom up to a probability 119903 Different spreading modelsimpose different requirement on the probability 119903 In thecurrent studies we adopt the mean-field model to determine119903 as suggested inmost of previous studies Formally for everyfixed time 119905 the probability of node 119909 isin R119901 being infected isdetermined by the following mean-field equation119889119903 (119909 119905)119889119905 = (1 minus 119903 (119909 119905)) sdot int

R119901119864 (119909 119910) 119903 (119910 119905) 119889119865 (119910) (1)

The interpretation of (1) is that at 119905 the temporal variationrate of the probability that node 119909 is infected (represented as119889119903(119909 119905)119889119905) is a proportion to the probability that node 119909 hasnot yet been infected by time 119905 (represented as 1 minus 119903(119909 119905))and the proportion is determined through a weighted sum ofthe probability of all other nodes in the network having beeninfected by 119905 The weight function describes the strength ofconnection between nodes 119909 and 119910 thus can be formulatedas the edge function 119864 Using the classical result ofmean-fieldequations [46ndash50] it can be easily verified that the infectionprobability 119903(119909 119905) in (1) is exactly equal to the probabilityof 119868(119909 119905) = 1 for a given right-continuous mean-field pointprocess 119868 satisfying the following119864 (119868 (119909 119905) minus 119868 (119909 119905minus) | 119868 (119909 119905minus) = 0)= int

R119901119864 (119909 119910) 119868 (119910 119905minus) 119889119865 (119910) (2)

where 119868(119909 119905minus) is the left-limit of process 119868(119909 sdot) The interpre-tation of (2) is more straightforward than (1) (2) points outthat the average rate of node 119909 being infected is contributedby all those nodes that (1) have a connection to 119909 and (2) havebeen infected by the current time These two conditions areoften imposed in literature

Let 119903 be a function satisfying the functional differentialequation (1) also denote 119891 as the density or mass functionassociated with probability 119865 then the event that a givennode 119909 is observed at time 119905 and its infectious status isobserved to be infected has the probability density

p1 (119909 119905) = 119891 (119909) 119903 (119909 119905) (3)

Complexity 5

in contrast the density for the event that 119909 is observed to beuninfected at 119905 is given as

p0 (119909 119905) = 119891 (119909) (1 minus 119903 (119909 119905)) (4)

Suppose that given a time 119905 the infectious status of a set ofrandomly picked nodesN isin R119901 is observable and represent-ed as

O119905 = 119868 (119909 119905) 119909 isinN (5)

with 119868(119909 119905) = 0 being not infected and 119868(119909 119905) = 1 beinginfected then the likelihood function of the observations O119905can be written in the following way by using (3) and (4)119871 (O119905 119864)= prod

119909isinN

(119891 (119909) 119903 (119909 119905))119868(119909119905) (119891 (119909) (1 minus 119903 (119909 119905)))1minus119868(119909119905) (6)

where we add the edge function 119864 into likelihood becauseit affects 119871 through determining the functional form of 119903Maximizing (6) can yield the classical maximum likelihood(ML) estimator of 11986433 Nonparametric Likelihood Estimator and Kernel Smooth-ing In the study of spreading process only the distributionflows of the form (5) are available the details of link structurebetween nodes represented by edge function 119864 are notobservable thus need to be estimated In this section weconstruct a nonparametric simulated maximum likelihoodestimator (NPSML) to the functional form of 119864 given theobserved distribution flows O119905119894 119894 = 1 119879 1199051 lt sdot sdot sdot lt119905119879 on a sequence of time The NPSML is an efficient non-parametric inference technique proposed by Kristensen andShin [29] NPSML applies well to the case where an explicitexpression of the likelihood function is not achievable whichis exactly what we need to handle because the distributionfunction 119903 in (6) is the solution to the functional differentialequation (1) there is no clean analytic expression available forit

However our task is different from the situation discussedoriginally in Kristensen and Shin [29] First the originalNPSML applies nonparametric kernel smoothing to approxi-mate the unknown likelihood function the model generatingthe likelihood function is still parametric but in (6) thelikelihood depends on the nonparametric edge function 119864To this situation one extra kernel smoothing step is needed toapproximate119864 Second in Kristensen and Shin [29] Kukackaand Barunik [36] simulation is conducted on the level ofrandom variable while in our case simulation is on thelevel of distribution that is equivalent to numerically solvethe mean-field equation (1) Finally due to the involvementof nonparametric model setup the model identifiability hasto be checked in order to guarantee the correctness of theresulting estimation

Due to the first and second differences we provide thefollowing algorithm to generate the simulated likelihoodfunction (in the following constructions we always use119870119901 to

denote the119901-dimensional standardGaussian kernel function119870119901ℎ(119909) = 119870119901(119909ℎ)ℎ119901 for some positive constant ℎ)

Step 1 Select constant 119889119905 gt 0 large positive integer 1198721and 1198722 (119889119905 is the length of every time step used fornumerically solving the functional differential equation (1)1198721 and1198722 are the number of random samples that will bedrawn to generate the kernel smoothing approximation to theunknown likelihood function and edge weight function)

Step 2 Draw 1198721 random samples 1199091 1199091198721 isin R119901 fromdistribution 119865 and1198722 random samples 1199081 1199081198722 isin R119901 timesR119901 from the product measure 119865 otimes 119865Step 3 Given 1198901 1198901198722 isin [0 1] construct function 119864 asfollows

119864 (119908) = sum1198722119894=11198702119901ℎ1 (119908 minus 119908119894) sdot 119890119894sum1198722119895=11198702119901ℎ1 (119908 minus 119908119895) (7)

Step 4 Given 119905119894 let O119905119894 = 119868(1199101 119905119894) 119868(119910119872 119905119894) denote theobservation set at time 119905119894 whose cardinality is119872 constructingfunction 119903( 119905119894) as follows

119903 (119910 119905119894) = sum119872119897=1119870119901ℎ2 (119910 minus 119910119897) sdot 119868 (119910119897 119905119894)sum119872119895=1119870119901ℎ2 (119910 minus 119910119895) (8)

Step 5 Solve mean-field equation (1) over interval [119905119894 119905119894+1) atthe set of sample point 1199091 1199091198721 drawn in Step 2 byEulerrsquosmethod with time step 119889119905 subject to the initial condition119903( 119905119894) as follows119903 (119909119895 119905119894 + (119896 + 1) sdot 119889119905)= 119903 (119909 119905119894 + 119896 sdot 119889119905) + (1 minus 119903 (119909119895 119905119894 + 119896 sdot 119889119905)) sdot 1198891199051198721

sdot 1198721sum119897=1

119864 (119909119895 119909119897) 119903 (119909119897 119905119894 + 119896 sdot 119889119905)(9)

where 119896 = 0 1 lfloor(119905119894+1 minus 119905119894)119889119905rfloor lfloor119886rfloor is the greatest integerless than 119886Step 6 For the observation set O119905119894+1 = 119868(1199101 119905119894+1) 119868(1199101198721015840 119905119894+1) at 119905119894+1 with cardinality 1198721015840 generate the simulateddensity at the sample nodes 119910119897 119897 = 1 1198721015840 as follows

119903 (119910119897 119905119894+1) = sum1198721119895=1119870119901ℎ3 (119910119897 minus 119909119895) sdot 119903 (119909119895 119905119894+1)sum1198721119895=1119870119901ℎ3 (119910119897 minus 119909119895) (10)

and construct the simulated likelihood function as follows

(O119905119894+1 1198901 1198901198721) = 1198721015840prod119897=1

(119891 (119910119897) 119903 (119910119897 119905119894+1))119868(119910119897 119905119894+1)sdot (119891 (119910119897) (1 minus 119903 (119910119897 119905119894+1)))1minus119868(119910119897119905119894+1) (11)

6 Complexity

The full information likelihood function for all observa-tion time can be constructed from (11) in the following waylowast (O119905119894 119894 = 1 119879 1198901 1198901198722)

= 119879prod119894=1

(O119905119894 1198901 1198901198722) (12)

The estimator of unknown edge function 119864 can be derivedfrom maximizing the simulated full information likelihoodfunction (12) by selecting appropriate 1198901 1198901198722 the finalestimator 119864lowast is constructed from the optimal 119890lowast1 119890lowast1198722 inthe way of (7)

Comparing to NPSML in Kristensen and Shin [29] thealgorithm in our study includes one extra sampling step todraw 1198722 random points from R119901 times R119901 which are usedfor approximating unknown 119864 In addition there are twokernel smoothing steps (Steps 4 and 6) regarding the densityfunction 119903 one for the initial density in the starting time 119905119894and the other for the end-time density at 119905119894+1 The two kernelsmoothing steps are not required when the total number ofnodes are small (a few hundred or a few thousand) in whichcase the whole set of nodes is directly used as the1198721 samplesdrawn in Step 2 However when the system has a giant nodeset (say millions) the sample size1198721 ≪ 119872 can be applied inorder to lift the computation efficiency Moreover the nodesets being observed at different observation time may notalways be identical it is more often the case that when a nodeis tracked to be uninfected at some time 119905 it will be regardedas safe and missing from the consecutive tracking in the nextfew observation time points In this interval-censor situationthe 1198721 sampled nodes and the two kernel smoothing stepsare needed to avoid the noise induced by censoring

As documented in Kristensen and Shin [29] Kukackaand Barunik [36] the NPSML estimator does not suffer fromthe ldquocurse of dimensionrdquo despite its nonparametric essencebecause the number of simulation samples is independentfrom the number of observation samples When the latter islarge the inefficiency induced by kernel smoothing vanishesduring the aggregation involved in the likelihood functionBy the same argument and the fact that in most real-world applications the number of observed nodes is giantour modified NPSML estimator is free from the curse ofdimensionality as well

34 A Fast Algorithm As shown in (9) the estimationprocedure requires repeated evaluation of the multiplicationbetween a 1198721 times 1198721 matrix and a 1198721 dimensional vectorthe computation complexity is of the order11987221 Although1198721can be taken as much smaller than the number of nodes inobservations (119872) it still has to increase as 119872 increases Sowhen 119872 is a giant number 1198721 has to be large as well thecomputation complexity of the entire estimation procedurewill be dominated by 11987221 In this section we propose a fastalgorithm which can reduce the computation complexity in(9) to be linearly dependent on 1198721 that is reasonable andimplementable in practice

The idea of the fast algorithm comes from the techniqueof agent-based simulation (ABS) In every iteration of ABS

every agent in the network is only required to interact withanother agent randomly picked from its neighbor In oursetting there is no strict ldquoneighborrdquo defined while it isstill possible to randomly pick one agent from the entirepopulation and the interaction is only counted on the givenagent and its randomly picked partner Formally Step 5 inprevious paragraph is split to three substeps

Step 5(1) For fixed 119905 and fixed 119909119895 isin 1199091 1199091198721 randomlypick one 119909119897(119895 119905) from 1199091 1199091198721Step 5(2) Compute119903 (119909119895 119905 + 119889119905) = 119903 (119909119895 119905) + (1 minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905) 119889119905 (13)

Step 5(3) Repeat the above two steps for all 119905 = 119905119896 119896 =0 1 lfloor(119905119894+1 minus 119905119894)119889119905rfloor minus 1 and for all 119905119894sComparing (9) and (13) the main difference is that the

inner product of vectors (ie the sum over 1199091 1199091198721) isreplaced with a scalar multiple so the resulting computationcomplexity for all1198721 nodes linearly depends on1198721 which issignificantly faster than the original algorithm

For the accuracy of the fast algorithm we claim that com-pared to the original algorithm the accuracy loss inducedby the fastness is controlled by a constant multiple of Δ119905 =max119905119894+1 minus 119905119894 for all 119894 In fact due to the randomness of119909119897(119895 119905)s it is easily to verify the following

(i) the expectation of the left hand side of (9) is identicalto the expectation of left hand side of (13)

(ii) denoteΔ(119895 119905) as the increment Δ(119895 119905) = (1minus119903(119909119895 119905))sdot119864(119909119895 119909119897(119895 119905))119903(119909119897(119895 119905) 119905) then for 119905119894 le 119905 1199051015840 le 119905119894+1 1 le119895 1198951015840 le 1198721 and all 119905119894s cov(Δ(119895 119905) Δ(1198951015840 1199051015840) | 119903(119909119895 119905119894)) le119905119894+1 minus 119905119894The property (i) and the identity for 1198951015840 = 119895 in (ii) are quitetrivial For 119905119894 lt 119905 lt 1199051015840 lt 119905119894+1 then cov(Δ(119895 119905) Δ(119895 1199051015840) |119903(119909119895 119905119894)) can be decomposed as the sum of the following twocomponents119860 = cov (Δ (119895 119905) (1 minus 119903 (119909119895 119905)) sdot 119864 (119909119895 119909119897 (119895 1199051015840))sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= var (119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840))le var (119903 (119909119895 119905) | 119903 (119909119895 119905119894)) = var (119903 (119909119895 119905)

minus 119903 (119909119895 119905119894) | 119903 (119909119895 119905119894)) le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894)2le (119905119894+1 minus 119905119894)2

Complexity 7

119861 = cov (Δ (119895 119905) (119903 (119909119895 1199051015840) minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 1199051015840)) sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= cov (1 minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840)minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840)) le cov (1minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840) minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))le 119864 (10038161003816100381610038161003816119903 (119909119895 1199051015840) minus 119903 (119909119895 119905)10038161003816100381610038161003816 | 119903 (119909119895 119905119894))le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894) le (119905119894+1 minus 119905119894)

(14)

where sdot infin is the 119871infin norm of a bounded valued functionThe above inequality holds straightforwardly from the fact 119903is bounded by 1 and its temporal derivative is given by (1)which is also uniformly bounded by 1 then the statement(ii) follows immediately

Using Property (i) (ii) and the law of large number itis straightforward that the difference between the likelihoodfunction constructed from (9) and by (13) is bounded by aconstant multiple of Δ119905 as the number of nodes119872 997888rarr infinIf we further require Δ119905 997888rarr 0 along with 119872 997888rarr infin thetwo types of calculation of the likelihood function would beasymptotically identical which leads to the same estimator tothe hidden network

Also notice that by the fast algorithm the choice of 119889119905 isindependent with the estimation accuracy so in practice itcan be selected directly as 119905119894+1 minus 119905119894 to increase the speed35 Block Network The NPSML algorithm constructed inprevious section can be further extended to make inferencefor the block network model As in many applications [3338 39] the existence of connection between two agents isonly relevant to the groups they belong to and the features ofagents only affect which group they are assigned to Withoutloss of generality the set of 119876 groups can be considered as apartition of the set of all nodes then the edge function canbe decomposed as two components

(i) the group weight function 1198641 R119901 997888rarr [0 1]119876(ii) the group-level edge weight 1198642 which is a 119876 times 119876

matrix with each entry valued in [0 1]The edge function 119864 for the block network model can berecovered from (i) and (ii) as follows119864 (119909 119910) = 1198641 (119909)⊤ 11986421198641 (119910) (15)

where the image of 1198641 is viewed as a119876-dimensional columnsvector and the subscript ⊤ represents vector transpose The

group weight function is required to satisfy that for every 119909and 1198641(119909) = (1199041 119904119876) there exist only one 119894 isin 1 119876with 119904119894 gt 0 which means every node can only have positiveprobability to belong to at most one group which guaranteesthe requirement that groups constitute a partition of the nodeset

The estimation of block network is equivalent to theestimation of (1) the group weight function 1198641 which isunknown and consists of the fully nonparametric componentof the network and (2) the interaction matrix 1198642 which is theparametric component of the network So the estimation isessentially semiparametric The six-step algorithm discussedin Section 33 and the fast algorithm in Section 34 are stillapplicable to that case The only modification is for Step 3where the kernel smoothing method is no longer applied tothe unknown edge weight 119864 Instead it is applied to generatethe estimate to group weight 1198641 Then the hidden weightfunction 119864 is constructed from the kernel smoothed 1198641 andthe given interaction matrix 1198642 in the way of (15)

Block network model has many advantages For instancewhen the number of groups involved is small and does notdepend on the number of nodes the number of parametersto solve is only1198721119876+1198762 while the number is1198722 when thereis no block structure at all To generate good approximationto the true edge function 1198722 has to increase along withthe number 11987221 (although slowly) when the node numberin observation is giant 1198721 has to be large as well then1198722 ≫ 1198721119876 + 1198762 Through block network we can sharplyreduce the dimension of parameter space when solving themaximum likelihood problem which can significantly lift thecomputation efficiency

In addition block network is much easier to identifythan the general fully nonparametric networks which will bediscussed in the next section Finally under block networkthe equilibrium infectious distribution of the spreading pro-cess has a clear analytic expression as stated in the followingproposition (proof for Proposition 1 is quite trivial henceomitted)

Proposition 1 Denote 1198641119894 (119909) as the projection of vector 1198641(119909)to its 119894th coordinate Define G119894 = 119909 isin R119901 1198641119894 (119909) gt 0that consists of the set of nodes belonging to group 119894 thenwithin a mean-fieldmodel of the form (2) with edge function 119864given by (15) every equilibrium infection distribution 119903(119909) (iesatisfying (1 minus 119903(119909)) sdot int119901

R119864(119909 119910)119903(119910)119889119865(119910) equiv 0) must have the

following form119903 (119909)= 0 119894119891 119909 isin G119894 P119894 (1198642)119899 119903 (119910 1199050) equiv 0 119891119900119903 119886119897119897 119910 119899 gt 01 119890119897119904119890 (16)

where 119903(119910 1199050) is the prescribed initial distribution of infectiousstatusP119894 is the projection of a vector to its 119894th dimension and(1198642)119899 denotes the 119899th power of matrix 1198642

Proposition 1 is meaningful in the sense that it links thetypes of equilibria infectious distribution with the matrix

8 Complexity

algebra facilitating the qualitative analysis of the equilibriadistribution For instance when 1198642 is an upper trianglematrix with all its lower off-diagonal entries being zero andall diagonal and upper off-diagonal entries being strictlypositive such as in (17)

(((((

119909 119909 119909 sdot sdot sdot 1199090 119909 119909 d0 sdot sdot sdot 119909 sdot sdot sdot 119909 d 0 119909 1199090 sdot sdot sdot 0 0 119909)))))

(17)

then the equilibriumdistribution 119903 and the initial distribution119903( 1199050) satisfy the relation119903 (119909) = 1 iff 119909 isin 1198761015840⋃

119894=1

G119894 lArrrArr119903 (119909 1199050) gt 0 iff 119909 isin 119876⋃

119894=1198761015840+1

G119894

(18)

36 Validity of NPSML Due to the nonparametric natureof the edge function 119864 its identifiability is tricky When thespreading process can be observed for multiple times (119898times) with random initializations and 119898 is large as assumedin Roudi and Hertz [41] Shen et al [40] both of the fullynonparametric network 119864 and the block network (1198641 1198642)are identifiable However in real applications a spreadingprocess can at most be observed for a few times it is notexpected that 119898 can be very large In that case the fullynonparametric edge function 119864 is no longer fully identifiableie there exists 119864 = 1198641015840 that leads to the same likelihoodfunction (6) in the limit case However it can be shownthat 119864 is identifiable up to compact convex set ie the setS1198640119903(1199050)119864 119871(O119905 119864) = 119871(O119905 1198640) is a compact convex setwithin the function space 1198712(R119901 times R119901) where 1198640 stands forthe true value of edge function It can also be proved that thesetS1198640119903(1199050) also varies along with the initial infectious status119903( 1199050) Formally we have that 119864 isin S1198640119903(1199050) if and only if thefollowing holds for all 119899 = 1 (M1minus119903(1199050)K119864)119899 119903 ( 1199050) equiv (M1minus119903(1199050)K1198640)119899 119903 ( 1199050) (19)

where K119864 is a bounded operator over the functionalspace 1198712(R119901 defined through 119864 as (K119864119892)(119909) fl int

R119901119864(119909119910)119892(119910)119889119865(119910) for every 119892 isin 1198712(R119901) with 119865 being the

default node distribution M119891 is the multiplicative operatordetermined by 119891 such that (M119891119892)(119909) = 119891(119909) sdot 119892(119909) the 119899thpower in (19) represents the self-composition of an operatorfor 119899 times (19) implies that the identifiability of the true edgefunction 1198640 is limited by the extent of the ergodicity of thespreading process within the node space R119901 For instancewhen there exists a small open set 119880 sub R119901 such that allnodes 119909 isin 119880 are infected before the initial time 1199050 ie119903(119909 1199050) equiv 1 for all 119909 isin 119880 then it can be verified by (19)

that all functions 119864 that deviate from 1198640 only within the bandset 119880 times R119901 are contained in S1198640 On the other hand if thereexists open 1198801015840 sub R119901 such that (M1minus119903(1199050)K1198640)119899119903(119909 1199050) equiv 0for all 119909 isin 1198801015840 and all 119899 then all functions 119864 that deviatefrom 1198640 only within 1198801015840 times 1198801015840 are contained in S1198640119903(1199050) Inboth of the two cases nodes in 119880 or 1198801015840 are not in the ergodicrange of the spreading process hence the transmission oftheir infectious status is not observable For nodes in119880 theirinfections occur ahead of the observation period hence notobservable after the start of spreading while for nodes in 1198801015840it can be verified that they will never be infected over theentire spreading processTherefore the identifiability of 1198640 isrestricted by the experience of the spreading process whichis reasonable

It is still an open question what conditions added to 1198640andor 119903( 1199050) can guarantee the identifiability of the fullynonparametric 1198640 But in the special case of block networksone simple identifiability condition can be figured out Infact for block networks it is straightforward that (11986410 11986420)is identifiable if and only if there does not exist a (1198641 1198642)pair that differs from the true (11986410 11986420) but leads to the samelikelihood function (6) in the limit case if and only if forthe true 11986420 the vector space spanned by the family of vectorsV119905 119905 ge 1199050 is the entire feature space R119876 ie V119905 119905 ge 1199050has full rank 119876 is the number of blocks V119905 = (V1199051 V119905119876)⊤is a 119876-dimensional column vector for every 119905 and for each119902 = 1 119876 V119905119902 = intR119901 11986410119902(119909)119903(119909 119905)119889119865(119909) 11986410119902 is the 119902thentry of 11986410(119909) To reach the full rank condition the well-known Wronskian determinant [51] can be applied leadingto the following clean-form identifiability condition

det V1199050 diag (119888 minus V1199050) 11986420V1199050 (diag (119888 minus V1199050) 11986420)119876minus1sdot V1199050 = 0 (20)

where 119888 is the other 119876-dimensional column vector (1198881 119888119876)⊤ determined by the true 11986410 function such that 119888119902 =intR11990111986410119902(119909)119889119865(119909) for 119902 = 1 119876 diag is the operation that

convert a 119876-dimensional vector to a 119876 times 119876 matrix with itsdiagonal elements being the given vector By the polynomialnature of the determinant function it can be verified that (20)holds ldquogenericallyrdquo in the sense that the set of 1198642s that forces(20) to be constantly equal to 0 is contained in an 119876 times 119876 minus 1dimensional surface within [0 1]119876times119876 and for those 1198642s that(20) is not constantly 0 the set of V1199050 that forces (20) to be 0 isonly contained in a119876minus 1 dimensional surface within [0 1]119876Therefore (20) holds for almost all 1198642 and V1199050 except forsome extreme cases that have measure 0 under the standardLebesgue measure

The ldquoalmostrdquo identifiability for block networks guaranteesthat in most cases when the number of observed nodesis large and the distribution of observation time is densethe estimated 1198641 and 1198642 from the NPSML asymptoticallyconverge to their true values and point-wisely follow multi-variate normal distributions This asymptotic result followsstraightforwardly from Kristensen and Shin [29] Kukacka

Complexity 9

and Barunik [36] and the general properties of maximumlikelihood estimator So the theoretical validity of the esti-mators developed in previous sections is established

Remark 2 (sparsity) Although in general the complete iden-tifiability for both the general network and the block networkis hard to achieve but if we follow the idea in the networkreconstruction literature Shen et al [40] only concentrateon the case that the hidden network is as sparse as possiblein the sense the 1198712 norm of the edge weight function11986422 = intR119901timesR119901(119864(119909 119910))2119889119865(119909)119889119865(119910) for the general networkandor the entry-wise square sum of the block network119864222 = sum119894119895(1198642119894119895)2 (this is the 1198712 norm on the discreteset with cardinality 1198762) is as small as possible To automatethe selection of the sparsest network we can consider the1198712 norm function as a penalty and subtract it from thelog-likelihood function (6) and then optimizing (6) wouldguarantee the solution converging to the sparsest networkIt is easily verified that such a sparse solution is alwaysasymptotically unique because as we discussed in previousparagraphs all networks that can lead to exactly the samelog-likelihood function form a compact convex set in thefunctional space by the compactness and convexity therealways exists a unique 119864 (or 1198642) such that its 1198712-distance tothe origin reaches the minimum

4 Numerical Experiment with Synthetic Data

Two synthetic data sets are generated from simulation totest the effectiveness of the NPSML estimator designed inprevious sections one for the fully nonparametric networkand the other for the block network For both examples thenode set N consists of 200 nodes which are drawn purelyrandomly from the unit cube [0 1)2 thus these nodes followthe uniform distribution Consider the following modelsetup

Example 1 (full nonparametric network) Edge function 119864 isnegatively proportional to the standard Euclidean distancebetween two nodes ie

119864 (119909 119910) = 1 minus radic⟨119909 minus 119910 119909 minus 119910⟩2 (21)

Example 2 (block network) Set 119876 = 3 block membershipfunction 1198641 satisfies

1198641 (119909 119910) = (1 0 0) 119894119891 119909 + 1199102 lt 13 (0 1 0) 119894119891 13 ge 119909 + 1199102 lt 23 (0 0 1) 119890119897119904119890 (22)

Matrix 1198642 is given as follows

1198642 = ( 0 1 0508 0 03001 0 0 ) (23)

For both examples the spreading process is initializedas that 30 of all nodes are infected at the very beginningand the infected nodes are randomly picked from the nodeset The full spreading process is generated from a discreteversion of (2) with sufficiently small time step (eg 119889119905 = 001that makes the resulting distribution flows as the first-orderapproximation to the true flows) a coarse time step (119889119905 = 01)is used for the estimation procedure (9) in order to test therobustness The process is followed up until day 5 ie thetime horizon in this simulation study is [0 119905) with 119905 = 5The observation of the distribution flows is supposed to beavailable only at the initial time and the end of every day iethere are 6 chances to observe the distribution of infectionsat 119905 = 0 1 2 3 4 5

For the fully nonparametric Example 1 the spreadingprocess is regenerated for 100 times with 100 random initial-izations this is necessary to address the identification issuesas pointed out in Section 36 For the 100 trails both the nodeset and the initial infectious subset are regenerated althoughtheir distributions are held constant For the block networkExample 2 the spreading process is generated only once inorder to evaluate the fitting performance under the situationthat no repeated observation of the spreading process isavailable For both examples the estimated edge function isevaluated on afixed set of grids for easy comparisonwhere thegrid set forms a lattice of the unit cube ieG = (01119896 01119897) 119896 119897 = 0 1 10

If all nodes are included in the computation of theNPSML estimator there are in principle a 40000(= 200 times200)-dimensional parameter space for full nonparametricnetwork Example 1 and a 609(= 200times3+3times3)-dimensionalparameter space for block network Example 2 to be searchedwhich are too time consuming As in the introduction ofNPSML estimator by the smoothness of edge functionthe number of nodes actually used to evaluate the edgefunction can be much smaller than the size of the entirenode set So to reduce computation load we generate another1198721 = 20 nodes from the uniform distribution which will beused in Step 3 (Section 33) for simulating the distributionfunction 119903 Accordingly the 1198722 = 400 node pairs willbe selected as the product of the 20 nodes for the fullynonparametric Example 1 then there are 400 parameters tooptimize in Example 1 and the size is quite reasonable formost nonparametric tasks For the block network Example 2as no node pairs are needed for block networks there areonly 69(= 20 times 3 + 3 times 3) parameters to optimize As for theselection of kernel width ℎ1 ℎ2 and ℎ3 we set ℎ1 = 400minus15ℎ2 = 200minus13 and ℎ3 = 20minus13 This is because the kernelsmooth method requires kernel width ℎ to satisfy 119899ℎ119896 997888rarr infinand 119899ℎ119896+2 997888rarr 0 in order to guarantee the consistency andasymptotic normality [28 29 36 52] where 119899 is input samplesize and 119896 is the dimension of the data By a rule of thumbwe select the kernel width as ℎ = 119899minus1(119896+1) For ℎ1 it is onlyused in Example 1 to estimate the edge function where thesample size is1198722 = 400 and the data dimension is two timesof the dimension of node space thus 119896 is 4 For ℎ2 and ℎ3they are used in both examples for estimating the distributionfunction 119903 thus data dimension 119896 is always 2The sample size

10 Complexity

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

(a) Fitting accuracy for fully nonparametric network

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

00

00

(b) Fitting accuracy for block network

Figure 1 Fitting accuracy for networks in Examples 1 and 2

for ℎ2 is 200 because it is used to turn the real observed 119903 on200 nodes to its kernel smooth version and the sample size forℎ3 is 20 because it turns the estimated 119903 on 20 sampled nodesto its values on the full node set

For the inference of the block network the number ofblock119876 is usually not known in prior so it is also a parameterto estimate As119876 determines the model dimension we adoptthe classical Bayesian information criteria (BIC) introducedin Schwarz [53] to detect the correct model dimension Asdefined in Schwarz [53] the greater BIC for a fitted modelimplies the better explanatory power [53] therefore the bestchoice of119876 corresponds to the maximal BIC In practice it isnot possible to calculate the BIC value for all positive119876 so wefollow the convention and only compute the BIC on a smallset of 119876 isin 1 2 3 4 5 The 119876 associated with the maximalBIC and the corresponding estimates of 1198641 1198642 are selectedas the final estimators and reported in the following In ourexample the correct119876 = 3 is always achieved so we omit thistrivial result

In Figure 1 we plot the difference between the real edgefunction and the NPSML estimated edge function on the setG timesG of node pairs for both examples where the horizontalaxis represents the true value of edge weight on every nodepair and the vertical axis represents the estimated weighton the same node pair To facilitate visualization Figure 1is sorted according to the horizontal axis in an ascendingmannerThe red dots represent the pairs of (estimatedweighttrue weight) the blue line sketches the identity function 119910 =119909 therefore a red dot being closer to the blue line meansthe better fitting accuracy Apparently for most of nodepairs the difference is negligible To further verify this visualjudgement 1205942 test is carried out for every node pair (119909 119910) isinGtimesGwith the null hypothesis 119864119909119910 = (119864(119909 119910)minus119864(119909 119910))2 = 0Following the asymptotic normality of NPSML estimator 119864 atevery (119909 119910) the distribution of test statistics 1198641199091199101205902119909119910 undernull hypothesis should be a 1205942 distribution with degree offreedom 1 where 120590119909119910 is the asymptotic variance of estimator119864(119909 119910) which can be calculated by bootstrap method Wecount the number of node pairs that fail to support the nullhypothesis at 90 credential level the result shows that in

Table 1 Estimation accuracy of 1198642Entries Bias Std P value119864211 0021 0032 0468119864212 -0006 0012 0383119864213 -0003 0029 0057119864221 -0001 0029 0028119864222 0022 0022 066119864223 -0002 0028 0059119864231 0005 0024 0165119864232 0018 0029 048119864233 0016 0021 0554

both examples only less than 10 out of all 10000 evaluationpairs in G times G fail to support the null hypothesis So ourestimation accuracy is quite satisfactory which agrees withthe visualization in Figure 1

For the block network Example 2 Table 1 presents theentry-wise accuracy of estimated 1198642 relative to (23) thefirst column presents the estimation bias the second andthird columns are the empirical standard deviation and theempirical P-values of the estimates from which we canconclude that the fitting accuracy is relatively perfect

For robustness check we also consider the synthetic datagenerated for different 119889119905 isin 001 005 01 015 02 and theimplementation of NPSML estimation on node samples withdifferent size 1198721 and 1198722 When 1198721 and 1198722 are increasedto 100 and 10 000 respectively no significant difference canbe detected in terms of the estimation accuracy measured bythe entry-wise bias between the true and the estimated edgeweight so we omit to plot this result For the rejection ratio at90 credential level of the null hypothesis that the true andestimated edgeweight are identical this ratio is lowered downa bit for the block network to less than 6 but no significantdecreasing can be detected for the general network exampleThis observation might be caused by the fact that for generalnetwork there are much more free parameters to estimatewhich reduces the convergence speed As for the different119889119905 the variation of estimation accuracy is not significant in

Complexity 11

all aspects this fact agrees with the discussion in the end ofSection 34

5 Experiment with Rumor Spreadingon Twitter

To demonstrate the usefulness of the NPSML method inreal-world applications we carry out an experiment with thedistribution flow data of a real rumor spreading process onTwitter We collect a data set of tweet articles with regardto the famous event ldquoUnite the Right rallyrdquo The ldquoUnite theRight rallyrdquo also known as the Charlottesville rally or Char-lottesville riots was a white supremacist rally that occurredin Charlottesville Virginia from August 11 to 12 2017 Therally occurred amidst the backdrop of controversy generatedby the removal of Confederate monuments throughout thecountry in response to the Charleston church shootingin 2015 The event turned violent after protesters clashedwith counter-protesters leaving over 30 injured The rallyalso attracted wide attentions on Twitter Twitter users ledvigilante campaigns on the platforms to personally identifyand denounce individual marchers in the rally following thestart of the campaignmany of themarchers were shamed andvilified by the social media community with several of therally attendees being dismissed from their jobs as a result ofthe campaign

Although the rally occurred in Charlottesville originallymessages andor comments related to it are immediatelyspread out through Twitter to users in many other placesincluding all major cities in US which inspired subsequentvigils and demonstrations in a number of cities across thecountry in the following days from Aug 11 and 12 2017 Tothis event we collect a time series of user level information(during the time from Aug 11 to Sep 4 2017) that recordedall Twitter user accounts in 20+ cities that spread at leastonce any messagecomment related to the rally during thecollection period We also collect the reaction time of everyuser to relevant messages and the user-specific informationsuch as the number of followers friends that an user has andhow many tweets the user has published in the past (historyposts) In addition the registration location of the Twitteraccount and its corresponding latitude and longitude are alsocollected

Similar to most rumor spreading data it is not possible totrack how every single message is spread from user to user byour collected data thus there is no way to directly identifythe interaction network among users But it is possible togenerate the distribution flows of users who have joined thespreading process Formally we can define at each time point119905 that a user has joined the process if and only if by 119905 heshehas at least reacted once to the messagescomments relatedto the rally then the data set can be easily converted to day-by-day distribution flows where at every time (day) 119905 sincethe origin (Aug 11 2017) we have an 119873-dimensional 0 1-valued vector with119873 being the number of all users in recordThe 119894th coordinate takes value 1 if and only if the 119894th user hasreacted to the rally-message at least once by 119905

For such a distribution flow data set we are interestedin making inference of features of the interaction network

between users because they are useful for making predictionfor the other spreading processes on Twitter regarding thesimilar social events To that end we apply the NPSMLmethod to estimate the hidden interaction network from theflow data Since there are 100000+ users in our record andit is likely that many users belong to the same latent group sothat their response pattern is similar to their common groupmembers it is more appropriate to assume the interactionnetwork behind our flow data is a block network and thenapply the NPSML to the block network model discussed inSection 35

To uncover the dependence of interaction links betweenusers on their geographical features andor friendshipfol-lowership relation we embed nodes(users) of the interac-tion network into a 5-dimensional feature space with thecoordinates representing the latitude longitude of accountlocation the number of friends followers and history postsrespectively To reduce the computation burden we adopt thebootstrap method randomly pick 10000 users from the fullset of users for 10 times and estimate the block network oneach of the subsamples For every subsample an estimatorfor membership weight function 1198641 and interaction matrix1198642 can be derived The aggregated estimator for interactionmatrix 1198642 is averaged over all subsample estimators for theblock membership weight 1198641 the aggregated estimator isderived by maximum a posteriori from the set of subsampleestimators

For robustness check we select 119889119905 isin 001 005 01 02to solve (9) As block network is used there is no need todraw the1198722 samples of node pairs only1198721 sampled nodesare needed for evaluating 119903 To reduce computation burdenwe consider to take a much smaller 1198721 than the number ofall users in record (10000+) to approximate the membershipweight function 1198641 and distribution function 119903 To check therobustness of our estimation with respect to different choiceof1198721 we preliminarily run the estimation program on a setof different 1198721 isin 50 100 200 500 The feature vectorof the 1198721 nodes in each trail is selected by conducting aK-means clustering on the full sample with the number ofclusters equal to 1198721 then the set of cluster centres will beselected as the feature vector Such selected feature vectorfor the1198721 nodes distributes asymptotically in the same waywithin the feature space as for the full sample of nodes Thepreliminary result shows that the estimators are not sensitiveto different choice of 119889119905 and become stable when1198721 is greaterthan 50 Therefore we will fix 119889119905 = 02 and 1198721 = 100 the100 cluster centres are also used as the evaluation nodes forthe estimated function 1198641

The choice of best block number is still based on max-imization of BIC value We plot the BIC for the three casesthat the block number equals to 3 4 and 5 in Figure 2 andthe BIC reaches its maximum when block number is 4 so weconsider a block network with 4 blocks as the final model forfurther analysis

Different visualizations of the block network are pro-vided Figure 3 sketches the geographic range of everyblockcommunity of the Twitter network the amount offollowers friends and history posts is plotted along with

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 4: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

4 Complexity

solving the maximum likelihood problem But when 119873 islarge the computation complexity would be dominated bythe matrix multiplication for the 119873 times 119873 adjacency matrixSince the approximation technique in Roudi and Hertz [41]still requires the matrix multiplication its speed-up effect forgiant networks may not be that significant More explorationson the fast reconstruction of giant spatial social networks areneeded

3 Model Setup

31 Feature Space Network We consider a weighted multidi-mensional spatial social network where nodes of the networkare considered as elements in a 119901-dimensional EuclideanspaceR119901 and every dimension ofR119901 is interpreted as a featureof nodes thus R119901 is interpretable as a feature space Edgesbetween nodes are assumed to depend on features of nodesin a smooth way ie edge set of the graph is equivalentto a smooth function (up to a certain order of derivatives)or an almost-everywhere smooth function (ie the functionis smooth for all points except those contained in a zero-measure set) denoted as 119864 R119901 times R119901 997888rarr [0 1] wherewithout loss of generality edge weight between two nodesis restrained within the unit interval Such a specificationadmits a stochastic-network interpretation of our model theweight can be thought of as the probability that two nodesshare an edge Since the nodes of the network may not beevenly distributed within the entire space R119901 without loss ofgenerality we assume the nodersquos distribution is characterizedby a probability measure 119865 on R119901 and 119865 is supposed tobe known from the data In sum the 119901-dimensional spatialnetwork can be recorded as 119866(R119901 119864 119865) or shortly 119866 whenthere is no ambiguity regarding its nodes space distributionand edge function

There are several advantages to assume that the spreadingprocess and distribution flows occurred within 119866 First theembedding of the node set into feature space R119901 allows us tocharacterize the feature information of nodes that are externalto the network structure [21 22 27] which are usually asimportant as the network structure itself in determining thespreading process and distribution flows Luo et al [22] arguethat including social-economic factors such as the intensityof population gathering in a set of locations can significantlyincrease the capacity of forecast of illness spreading amongresidents Viboud et al [45] report similar findings Secondallowing nodes unevenly distributed within the feature spaceadmits us to include more general network into analysis Forinstance by proper choice of the measure 119865 (eg finitelysupported) it is even possible to consider a network withonly finitely many nodes but sitting in the infinite featurespace R119901 this allows us to include most of networks thatwe can meet in practice Finally allowing the edge weight tosmoothly depend on features of both the flow-in and flow-out nodes makes it possible to incorporate the backgroundinformation into the interaction mechanism this is criticalwhen the network itself is only a small component of a largerbackground system [27] In addition a by-product of treatingedges as a smooth function is its induced computationalefficiency In fact when a network consists of a giant number

of nodes even a simple summation operation can take a longtime and huge memory but when edges vary smoothly alongwith nodes it becomes possible to only do calculation on asmall set of nodes and the global features of edges then canbe inferred from the result on the relatively small set by thekernel smoothing technique from nonparametric statistics[28 29] Based on these advantages we will concentrate onthe spatial network 119866(R119901 119864 119865) instead of a more generalconcept of network

32 Mean-Field Models To model spreading processeswithin a spatial network 119866(R119901 119864 119865) we follow the conven-tion in the studies in rumor spreading literature [10 17] andadopt the common assumption that a rumor can be spreadout from a node 119909 to the other 119910 if and only if (1) the initialnode 119909must have been infected with the rumor recorded asthe event 119868(119909) = 1 (2) there is an edge between them orequivalently 119864(119909 119910) gt 0 and (3) when condition (1) and (2)hold whether or not the spreading actually happens is purelyrandom up to a probability 119903 Different spreading modelsimpose different requirement on the probability 119903 In thecurrent studies we adopt the mean-field model to determine119903 as suggested inmost of previous studies Formally for everyfixed time 119905 the probability of node 119909 isin R119901 being infected isdetermined by the following mean-field equation119889119903 (119909 119905)119889119905 = (1 minus 119903 (119909 119905)) sdot int

R119901119864 (119909 119910) 119903 (119910 119905) 119889119865 (119910) (1)

The interpretation of (1) is that at 119905 the temporal variationrate of the probability that node 119909 is infected (represented as119889119903(119909 119905)119889119905) is a proportion to the probability that node 119909 hasnot yet been infected by time 119905 (represented as 1 minus 119903(119909 119905))and the proportion is determined through a weighted sum ofthe probability of all other nodes in the network having beeninfected by 119905 The weight function describes the strength ofconnection between nodes 119909 and 119910 thus can be formulatedas the edge function 119864 Using the classical result ofmean-fieldequations [46ndash50] it can be easily verified that the infectionprobability 119903(119909 119905) in (1) is exactly equal to the probabilityof 119868(119909 119905) = 1 for a given right-continuous mean-field pointprocess 119868 satisfying the following119864 (119868 (119909 119905) minus 119868 (119909 119905minus) | 119868 (119909 119905minus) = 0)= int

R119901119864 (119909 119910) 119868 (119910 119905minus) 119889119865 (119910) (2)

where 119868(119909 119905minus) is the left-limit of process 119868(119909 sdot) The interpre-tation of (2) is more straightforward than (1) (2) points outthat the average rate of node 119909 being infected is contributedby all those nodes that (1) have a connection to 119909 and (2) havebeen infected by the current time These two conditions areoften imposed in literature

Let 119903 be a function satisfying the functional differentialequation (1) also denote 119891 as the density or mass functionassociated with probability 119865 then the event that a givennode 119909 is observed at time 119905 and its infectious status isobserved to be infected has the probability density

p1 (119909 119905) = 119891 (119909) 119903 (119909 119905) (3)

Complexity 5

in contrast the density for the event that 119909 is observed to beuninfected at 119905 is given as

p0 (119909 119905) = 119891 (119909) (1 minus 119903 (119909 119905)) (4)

Suppose that given a time 119905 the infectious status of a set ofrandomly picked nodesN isin R119901 is observable and represent-ed as

O119905 = 119868 (119909 119905) 119909 isinN (5)

with 119868(119909 119905) = 0 being not infected and 119868(119909 119905) = 1 beinginfected then the likelihood function of the observations O119905can be written in the following way by using (3) and (4)119871 (O119905 119864)= prod

119909isinN

(119891 (119909) 119903 (119909 119905))119868(119909119905) (119891 (119909) (1 minus 119903 (119909 119905)))1minus119868(119909119905) (6)

where we add the edge function 119864 into likelihood becauseit affects 119871 through determining the functional form of 119903Maximizing (6) can yield the classical maximum likelihood(ML) estimator of 11986433 Nonparametric Likelihood Estimator and Kernel Smooth-ing In the study of spreading process only the distributionflows of the form (5) are available the details of link structurebetween nodes represented by edge function 119864 are notobservable thus need to be estimated In this section weconstruct a nonparametric simulated maximum likelihoodestimator (NPSML) to the functional form of 119864 given theobserved distribution flows O119905119894 119894 = 1 119879 1199051 lt sdot sdot sdot lt119905119879 on a sequence of time The NPSML is an efficient non-parametric inference technique proposed by Kristensen andShin [29] NPSML applies well to the case where an explicitexpression of the likelihood function is not achievable whichis exactly what we need to handle because the distributionfunction 119903 in (6) is the solution to the functional differentialequation (1) there is no clean analytic expression available forit

However our task is different from the situation discussedoriginally in Kristensen and Shin [29] First the originalNPSML applies nonparametric kernel smoothing to approxi-mate the unknown likelihood function the model generatingthe likelihood function is still parametric but in (6) thelikelihood depends on the nonparametric edge function 119864To this situation one extra kernel smoothing step is needed toapproximate119864 Second in Kristensen and Shin [29] Kukackaand Barunik [36] simulation is conducted on the level ofrandom variable while in our case simulation is on thelevel of distribution that is equivalent to numerically solvethe mean-field equation (1) Finally due to the involvementof nonparametric model setup the model identifiability hasto be checked in order to guarantee the correctness of theresulting estimation

Due to the first and second differences we provide thefollowing algorithm to generate the simulated likelihoodfunction (in the following constructions we always use119870119901 to

denote the119901-dimensional standardGaussian kernel function119870119901ℎ(119909) = 119870119901(119909ℎ)ℎ119901 for some positive constant ℎ)

Step 1 Select constant 119889119905 gt 0 large positive integer 1198721and 1198722 (119889119905 is the length of every time step used fornumerically solving the functional differential equation (1)1198721 and1198722 are the number of random samples that will bedrawn to generate the kernel smoothing approximation to theunknown likelihood function and edge weight function)

Step 2 Draw 1198721 random samples 1199091 1199091198721 isin R119901 fromdistribution 119865 and1198722 random samples 1199081 1199081198722 isin R119901 timesR119901 from the product measure 119865 otimes 119865Step 3 Given 1198901 1198901198722 isin [0 1] construct function 119864 asfollows

119864 (119908) = sum1198722119894=11198702119901ℎ1 (119908 minus 119908119894) sdot 119890119894sum1198722119895=11198702119901ℎ1 (119908 minus 119908119895) (7)

Step 4 Given 119905119894 let O119905119894 = 119868(1199101 119905119894) 119868(119910119872 119905119894) denote theobservation set at time 119905119894 whose cardinality is119872 constructingfunction 119903( 119905119894) as follows

119903 (119910 119905119894) = sum119872119897=1119870119901ℎ2 (119910 minus 119910119897) sdot 119868 (119910119897 119905119894)sum119872119895=1119870119901ℎ2 (119910 minus 119910119895) (8)

Step 5 Solve mean-field equation (1) over interval [119905119894 119905119894+1) atthe set of sample point 1199091 1199091198721 drawn in Step 2 byEulerrsquosmethod with time step 119889119905 subject to the initial condition119903( 119905119894) as follows119903 (119909119895 119905119894 + (119896 + 1) sdot 119889119905)= 119903 (119909 119905119894 + 119896 sdot 119889119905) + (1 minus 119903 (119909119895 119905119894 + 119896 sdot 119889119905)) sdot 1198891199051198721

sdot 1198721sum119897=1

119864 (119909119895 119909119897) 119903 (119909119897 119905119894 + 119896 sdot 119889119905)(9)

where 119896 = 0 1 lfloor(119905119894+1 minus 119905119894)119889119905rfloor lfloor119886rfloor is the greatest integerless than 119886Step 6 For the observation set O119905119894+1 = 119868(1199101 119905119894+1) 119868(1199101198721015840 119905119894+1) at 119905119894+1 with cardinality 1198721015840 generate the simulateddensity at the sample nodes 119910119897 119897 = 1 1198721015840 as follows

119903 (119910119897 119905119894+1) = sum1198721119895=1119870119901ℎ3 (119910119897 minus 119909119895) sdot 119903 (119909119895 119905119894+1)sum1198721119895=1119870119901ℎ3 (119910119897 minus 119909119895) (10)

and construct the simulated likelihood function as follows

(O119905119894+1 1198901 1198901198721) = 1198721015840prod119897=1

(119891 (119910119897) 119903 (119910119897 119905119894+1))119868(119910119897 119905119894+1)sdot (119891 (119910119897) (1 minus 119903 (119910119897 119905119894+1)))1minus119868(119910119897119905119894+1) (11)

6 Complexity

The full information likelihood function for all observa-tion time can be constructed from (11) in the following waylowast (O119905119894 119894 = 1 119879 1198901 1198901198722)

= 119879prod119894=1

(O119905119894 1198901 1198901198722) (12)

The estimator of unknown edge function 119864 can be derivedfrom maximizing the simulated full information likelihoodfunction (12) by selecting appropriate 1198901 1198901198722 the finalestimator 119864lowast is constructed from the optimal 119890lowast1 119890lowast1198722 inthe way of (7)

Comparing to NPSML in Kristensen and Shin [29] thealgorithm in our study includes one extra sampling step todraw 1198722 random points from R119901 times R119901 which are usedfor approximating unknown 119864 In addition there are twokernel smoothing steps (Steps 4 and 6) regarding the densityfunction 119903 one for the initial density in the starting time 119905119894and the other for the end-time density at 119905119894+1 The two kernelsmoothing steps are not required when the total number ofnodes are small (a few hundred or a few thousand) in whichcase the whole set of nodes is directly used as the1198721 samplesdrawn in Step 2 However when the system has a giant nodeset (say millions) the sample size1198721 ≪ 119872 can be applied inorder to lift the computation efficiency Moreover the nodesets being observed at different observation time may notalways be identical it is more often the case that when a nodeis tracked to be uninfected at some time 119905 it will be regardedas safe and missing from the consecutive tracking in the nextfew observation time points In this interval-censor situationthe 1198721 sampled nodes and the two kernel smoothing stepsare needed to avoid the noise induced by censoring

As documented in Kristensen and Shin [29] Kukackaand Barunik [36] the NPSML estimator does not suffer fromthe ldquocurse of dimensionrdquo despite its nonparametric essencebecause the number of simulation samples is independentfrom the number of observation samples When the latter islarge the inefficiency induced by kernel smoothing vanishesduring the aggregation involved in the likelihood functionBy the same argument and the fact that in most real-world applications the number of observed nodes is giantour modified NPSML estimator is free from the curse ofdimensionality as well

34 A Fast Algorithm As shown in (9) the estimationprocedure requires repeated evaluation of the multiplicationbetween a 1198721 times 1198721 matrix and a 1198721 dimensional vectorthe computation complexity is of the order11987221 Although1198721can be taken as much smaller than the number of nodes inobservations (119872) it still has to increase as 119872 increases Sowhen 119872 is a giant number 1198721 has to be large as well thecomputation complexity of the entire estimation procedurewill be dominated by 11987221 In this section we propose a fastalgorithm which can reduce the computation complexity in(9) to be linearly dependent on 1198721 that is reasonable andimplementable in practice

The idea of the fast algorithm comes from the techniqueof agent-based simulation (ABS) In every iteration of ABS

every agent in the network is only required to interact withanother agent randomly picked from its neighbor In oursetting there is no strict ldquoneighborrdquo defined while it isstill possible to randomly pick one agent from the entirepopulation and the interaction is only counted on the givenagent and its randomly picked partner Formally Step 5 inprevious paragraph is split to three substeps

Step 5(1) For fixed 119905 and fixed 119909119895 isin 1199091 1199091198721 randomlypick one 119909119897(119895 119905) from 1199091 1199091198721Step 5(2) Compute119903 (119909119895 119905 + 119889119905) = 119903 (119909119895 119905) + (1 minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905) 119889119905 (13)

Step 5(3) Repeat the above two steps for all 119905 = 119905119896 119896 =0 1 lfloor(119905119894+1 minus 119905119894)119889119905rfloor minus 1 and for all 119905119894sComparing (9) and (13) the main difference is that the

inner product of vectors (ie the sum over 1199091 1199091198721) isreplaced with a scalar multiple so the resulting computationcomplexity for all1198721 nodes linearly depends on1198721 which issignificantly faster than the original algorithm

For the accuracy of the fast algorithm we claim that com-pared to the original algorithm the accuracy loss inducedby the fastness is controlled by a constant multiple of Δ119905 =max119905119894+1 minus 119905119894 for all 119894 In fact due to the randomness of119909119897(119895 119905)s it is easily to verify the following

(i) the expectation of the left hand side of (9) is identicalto the expectation of left hand side of (13)

(ii) denoteΔ(119895 119905) as the increment Δ(119895 119905) = (1minus119903(119909119895 119905))sdot119864(119909119895 119909119897(119895 119905))119903(119909119897(119895 119905) 119905) then for 119905119894 le 119905 1199051015840 le 119905119894+1 1 le119895 1198951015840 le 1198721 and all 119905119894s cov(Δ(119895 119905) Δ(1198951015840 1199051015840) | 119903(119909119895 119905119894)) le119905119894+1 minus 119905119894The property (i) and the identity for 1198951015840 = 119895 in (ii) are quitetrivial For 119905119894 lt 119905 lt 1199051015840 lt 119905119894+1 then cov(Δ(119895 119905) Δ(119895 1199051015840) |119903(119909119895 119905119894)) can be decomposed as the sum of the following twocomponents119860 = cov (Δ (119895 119905) (1 minus 119903 (119909119895 119905)) sdot 119864 (119909119895 119909119897 (119895 1199051015840))sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= var (119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840))le var (119903 (119909119895 119905) | 119903 (119909119895 119905119894)) = var (119903 (119909119895 119905)

minus 119903 (119909119895 119905119894) | 119903 (119909119895 119905119894)) le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894)2le (119905119894+1 minus 119905119894)2

Complexity 7

119861 = cov (Δ (119895 119905) (119903 (119909119895 1199051015840) minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 1199051015840)) sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= cov (1 minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840)minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840)) le cov (1minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840) minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))le 119864 (10038161003816100381610038161003816119903 (119909119895 1199051015840) minus 119903 (119909119895 119905)10038161003816100381610038161003816 | 119903 (119909119895 119905119894))le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894) le (119905119894+1 minus 119905119894)

(14)

where sdot infin is the 119871infin norm of a bounded valued functionThe above inequality holds straightforwardly from the fact 119903is bounded by 1 and its temporal derivative is given by (1)which is also uniformly bounded by 1 then the statement(ii) follows immediately

Using Property (i) (ii) and the law of large number itis straightforward that the difference between the likelihoodfunction constructed from (9) and by (13) is bounded by aconstant multiple of Δ119905 as the number of nodes119872 997888rarr infinIf we further require Δ119905 997888rarr 0 along with 119872 997888rarr infin thetwo types of calculation of the likelihood function would beasymptotically identical which leads to the same estimator tothe hidden network

Also notice that by the fast algorithm the choice of 119889119905 isindependent with the estimation accuracy so in practice itcan be selected directly as 119905119894+1 minus 119905119894 to increase the speed35 Block Network The NPSML algorithm constructed inprevious section can be further extended to make inferencefor the block network model As in many applications [3338 39] the existence of connection between two agents isonly relevant to the groups they belong to and the features ofagents only affect which group they are assigned to Withoutloss of generality the set of 119876 groups can be considered as apartition of the set of all nodes then the edge function canbe decomposed as two components

(i) the group weight function 1198641 R119901 997888rarr [0 1]119876(ii) the group-level edge weight 1198642 which is a 119876 times 119876

matrix with each entry valued in [0 1]The edge function 119864 for the block network model can berecovered from (i) and (ii) as follows119864 (119909 119910) = 1198641 (119909)⊤ 11986421198641 (119910) (15)

where the image of 1198641 is viewed as a119876-dimensional columnsvector and the subscript ⊤ represents vector transpose The

group weight function is required to satisfy that for every 119909and 1198641(119909) = (1199041 119904119876) there exist only one 119894 isin 1 119876with 119904119894 gt 0 which means every node can only have positiveprobability to belong to at most one group which guaranteesthe requirement that groups constitute a partition of the nodeset

The estimation of block network is equivalent to theestimation of (1) the group weight function 1198641 which isunknown and consists of the fully nonparametric componentof the network and (2) the interaction matrix 1198642 which is theparametric component of the network So the estimation isessentially semiparametric The six-step algorithm discussedin Section 33 and the fast algorithm in Section 34 are stillapplicable to that case The only modification is for Step 3where the kernel smoothing method is no longer applied tothe unknown edge weight 119864 Instead it is applied to generatethe estimate to group weight 1198641 Then the hidden weightfunction 119864 is constructed from the kernel smoothed 1198641 andthe given interaction matrix 1198642 in the way of (15)

Block network model has many advantages For instancewhen the number of groups involved is small and does notdepend on the number of nodes the number of parametersto solve is only1198721119876+1198762 while the number is1198722 when thereis no block structure at all To generate good approximationto the true edge function 1198722 has to increase along withthe number 11987221 (although slowly) when the node numberin observation is giant 1198721 has to be large as well then1198722 ≫ 1198721119876 + 1198762 Through block network we can sharplyreduce the dimension of parameter space when solving themaximum likelihood problem which can significantly lift thecomputation efficiency

In addition block network is much easier to identifythan the general fully nonparametric networks which will bediscussed in the next section Finally under block networkthe equilibrium infectious distribution of the spreading pro-cess has a clear analytic expression as stated in the followingproposition (proof for Proposition 1 is quite trivial henceomitted)

Proposition 1 Denote 1198641119894 (119909) as the projection of vector 1198641(119909)to its 119894th coordinate Define G119894 = 119909 isin R119901 1198641119894 (119909) gt 0that consists of the set of nodes belonging to group 119894 thenwithin a mean-fieldmodel of the form (2) with edge function 119864given by (15) every equilibrium infection distribution 119903(119909) (iesatisfying (1 minus 119903(119909)) sdot int119901

R119864(119909 119910)119903(119910)119889119865(119910) equiv 0) must have the

following form119903 (119909)= 0 119894119891 119909 isin G119894 P119894 (1198642)119899 119903 (119910 1199050) equiv 0 119891119900119903 119886119897119897 119910 119899 gt 01 119890119897119904119890 (16)

where 119903(119910 1199050) is the prescribed initial distribution of infectiousstatusP119894 is the projection of a vector to its 119894th dimension and(1198642)119899 denotes the 119899th power of matrix 1198642

Proposition 1 is meaningful in the sense that it links thetypes of equilibria infectious distribution with the matrix

8 Complexity

algebra facilitating the qualitative analysis of the equilibriadistribution For instance when 1198642 is an upper trianglematrix with all its lower off-diagonal entries being zero andall diagonal and upper off-diagonal entries being strictlypositive such as in (17)

(((((

119909 119909 119909 sdot sdot sdot 1199090 119909 119909 d0 sdot sdot sdot 119909 sdot sdot sdot 119909 d 0 119909 1199090 sdot sdot sdot 0 0 119909)))))

(17)

then the equilibriumdistribution 119903 and the initial distribution119903( 1199050) satisfy the relation119903 (119909) = 1 iff 119909 isin 1198761015840⋃

119894=1

G119894 lArrrArr119903 (119909 1199050) gt 0 iff 119909 isin 119876⋃

119894=1198761015840+1

G119894

(18)

36 Validity of NPSML Due to the nonparametric natureof the edge function 119864 its identifiability is tricky When thespreading process can be observed for multiple times (119898times) with random initializations and 119898 is large as assumedin Roudi and Hertz [41] Shen et al [40] both of the fullynonparametric network 119864 and the block network (1198641 1198642)are identifiable However in real applications a spreadingprocess can at most be observed for a few times it is notexpected that 119898 can be very large In that case the fullynonparametric edge function 119864 is no longer fully identifiableie there exists 119864 = 1198641015840 that leads to the same likelihoodfunction (6) in the limit case However it can be shownthat 119864 is identifiable up to compact convex set ie the setS1198640119903(1199050)119864 119871(O119905 119864) = 119871(O119905 1198640) is a compact convex setwithin the function space 1198712(R119901 times R119901) where 1198640 stands forthe true value of edge function It can also be proved that thesetS1198640119903(1199050) also varies along with the initial infectious status119903( 1199050) Formally we have that 119864 isin S1198640119903(1199050) if and only if thefollowing holds for all 119899 = 1 (M1minus119903(1199050)K119864)119899 119903 ( 1199050) equiv (M1minus119903(1199050)K1198640)119899 119903 ( 1199050) (19)

where K119864 is a bounded operator over the functionalspace 1198712(R119901 defined through 119864 as (K119864119892)(119909) fl int

R119901119864(119909119910)119892(119910)119889119865(119910) for every 119892 isin 1198712(R119901) with 119865 being the

default node distribution M119891 is the multiplicative operatordetermined by 119891 such that (M119891119892)(119909) = 119891(119909) sdot 119892(119909) the 119899thpower in (19) represents the self-composition of an operatorfor 119899 times (19) implies that the identifiability of the true edgefunction 1198640 is limited by the extent of the ergodicity of thespreading process within the node space R119901 For instancewhen there exists a small open set 119880 sub R119901 such that allnodes 119909 isin 119880 are infected before the initial time 1199050 ie119903(119909 1199050) equiv 1 for all 119909 isin 119880 then it can be verified by (19)

that all functions 119864 that deviate from 1198640 only within the bandset 119880 times R119901 are contained in S1198640 On the other hand if thereexists open 1198801015840 sub R119901 such that (M1minus119903(1199050)K1198640)119899119903(119909 1199050) equiv 0for all 119909 isin 1198801015840 and all 119899 then all functions 119864 that deviatefrom 1198640 only within 1198801015840 times 1198801015840 are contained in S1198640119903(1199050) Inboth of the two cases nodes in 119880 or 1198801015840 are not in the ergodicrange of the spreading process hence the transmission oftheir infectious status is not observable For nodes in119880 theirinfections occur ahead of the observation period hence notobservable after the start of spreading while for nodes in 1198801015840it can be verified that they will never be infected over theentire spreading processTherefore the identifiability of 1198640 isrestricted by the experience of the spreading process whichis reasonable

It is still an open question what conditions added to 1198640andor 119903( 1199050) can guarantee the identifiability of the fullynonparametric 1198640 But in the special case of block networksone simple identifiability condition can be figured out Infact for block networks it is straightforward that (11986410 11986420)is identifiable if and only if there does not exist a (1198641 1198642)pair that differs from the true (11986410 11986420) but leads to the samelikelihood function (6) in the limit case if and only if forthe true 11986420 the vector space spanned by the family of vectorsV119905 119905 ge 1199050 is the entire feature space R119876 ie V119905 119905 ge 1199050has full rank 119876 is the number of blocks V119905 = (V1199051 V119905119876)⊤is a 119876-dimensional column vector for every 119905 and for each119902 = 1 119876 V119905119902 = intR119901 11986410119902(119909)119903(119909 119905)119889119865(119909) 11986410119902 is the 119902thentry of 11986410(119909) To reach the full rank condition the well-known Wronskian determinant [51] can be applied leadingto the following clean-form identifiability condition

det V1199050 diag (119888 minus V1199050) 11986420V1199050 (diag (119888 minus V1199050) 11986420)119876minus1sdot V1199050 = 0 (20)

where 119888 is the other 119876-dimensional column vector (1198881 119888119876)⊤ determined by the true 11986410 function such that 119888119902 =intR11990111986410119902(119909)119889119865(119909) for 119902 = 1 119876 diag is the operation that

convert a 119876-dimensional vector to a 119876 times 119876 matrix with itsdiagonal elements being the given vector By the polynomialnature of the determinant function it can be verified that (20)holds ldquogenericallyrdquo in the sense that the set of 1198642s that forces(20) to be constantly equal to 0 is contained in an 119876 times 119876 minus 1dimensional surface within [0 1]119876times119876 and for those 1198642s that(20) is not constantly 0 the set of V1199050 that forces (20) to be 0 isonly contained in a119876minus 1 dimensional surface within [0 1]119876Therefore (20) holds for almost all 1198642 and V1199050 except forsome extreme cases that have measure 0 under the standardLebesgue measure

The ldquoalmostrdquo identifiability for block networks guaranteesthat in most cases when the number of observed nodesis large and the distribution of observation time is densethe estimated 1198641 and 1198642 from the NPSML asymptoticallyconverge to their true values and point-wisely follow multi-variate normal distributions This asymptotic result followsstraightforwardly from Kristensen and Shin [29] Kukacka

Complexity 9

and Barunik [36] and the general properties of maximumlikelihood estimator So the theoretical validity of the esti-mators developed in previous sections is established

Remark 2 (sparsity) Although in general the complete iden-tifiability for both the general network and the block networkis hard to achieve but if we follow the idea in the networkreconstruction literature Shen et al [40] only concentrateon the case that the hidden network is as sparse as possiblein the sense the 1198712 norm of the edge weight function11986422 = intR119901timesR119901(119864(119909 119910))2119889119865(119909)119889119865(119910) for the general networkandor the entry-wise square sum of the block network119864222 = sum119894119895(1198642119894119895)2 (this is the 1198712 norm on the discreteset with cardinality 1198762) is as small as possible To automatethe selection of the sparsest network we can consider the1198712 norm function as a penalty and subtract it from thelog-likelihood function (6) and then optimizing (6) wouldguarantee the solution converging to the sparsest networkIt is easily verified that such a sparse solution is alwaysasymptotically unique because as we discussed in previousparagraphs all networks that can lead to exactly the samelog-likelihood function form a compact convex set in thefunctional space by the compactness and convexity therealways exists a unique 119864 (or 1198642) such that its 1198712-distance tothe origin reaches the minimum

4 Numerical Experiment with Synthetic Data

Two synthetic data sets are generated from simulation totest the effectiveness of the NPSML estimator designed inprevious sections one for the fully nonparametric networkand the other for the block network For both examples thenode set N consists of 200 nodes which are drawn purelyrandomly from the unit cube [0 1)2 thus these nodes followthe uniform distribution Consider the following modelsetup

Example 1 (full nonparametric network) Edge function 119864 isnegatively proportional to the standard Euclidean distancebetween two nodes ie

119864 (119909 119910) = 1 minus radic⟨119909 minus 119910 119909 minus 119910⟩2 (21)

Example 2 (block network) Set 119876 = 3 block membershipfunction 1198641 satisfies

1198641 (119909 119910) = (1 0 0) 119894119891 119909 + 1199102 lt 13 (0 1 0) 119894119891 13 ge 119909 + 1199102 lt 23 (0 0 1) 119890119897119904119890 (22)

Matrix 1198642 is given as follows

1198642 = ( 0 1 0508 0 03001 0 0 ) (23)

For both examples the spreading process is initializedas that 30 of all nodes are infected at the very beginningand the infected nodes are randomly picked from the nodeset The full spreading process is generated from a discreteversion of (2) with sufficiently small time step (eg 119889119905 = 001that makes the resulting distribution flows as the first-orderapproximation to the true flows) a coarse time step (119889119905 = 01)is used for the estimation procedure (9) in order to test therobustness The process is followed up until day 5 ie thetime horizon in this simulation study is [0 119905) with 119905 = 5The observation of the distribution flows is supposed to beavailable only at the initial time and the end of every day iethere are 6 chances to observe the distribution of infectionsat 119905 = 0 1 2 3 4 5

For the fully nonparametric Example 1 the spreadingprocess is regenerated for 100 times with 100 random initial-izations this is necessary to address the identification issuesas pointed out in Section 36 For the 100 trails both the nodeset and the initial infectious subset are regenerated althoughtheir distributions are held constant For the block networkExample 2 the spreading process is generated only once inorder to evaluate the fitting performance under the situationthat no repeated observation of the spreading process isavailable For both examples the estimated edge function isevaluated on afixed set of grids for easy comparisonwhere thegrid set forms a lattice of the unit cube ieG = (01119896 01119897) 119896 119897 = 0 1 10

If all nodes are included in the computation of theNPSML estimator there are in principle a 40000(= 200 times200)-dimensional parameter space for full nonparametricnetwork Example 1 and a 609(= 200times3+3times3)-dimensionalparameter space for block network Example 2 to be searchedwhich are too time consuming As in the introduction ofNPSML estimator by the smoothness of edge functionthe number of nodes actually used to evaluate the edgefunction can be much smaller than the size of the entirenode set So to reduce computation load we generate another1198721 = 20 nodes from the uniform distribution which will beused in Step 3 (Section 33) for simulating the distributionfunction 119903 Accordingly the 1198722 = 400 node pairs willbe selected as the product of the 20 nodes for the fullynonparametric Example 1 then there are 400 parameters tooptimize in Example 1 and the size is quite reasonable formost nonparametric tasks For the block network Example 2as no node pairs are needed for block networks there areonly 69(= 20 times 3 + 3 times 3) parameters to optimize As for theselection of kernel width ℎ1 ℎ2 and ℎ3 we set ℎ1 = 400minus15ℎ2 = 200minus13 and ℎ3 = 20minus13 This is because the kernelsmooth method requires kernel width ℎ to satisfy 119899ℎ119896 997888rarr infinand 119899ℎ119896+2 997888rarr 0 in order to guarantee the consistency andasymptotic normality [28 29 36 52] where 119899 is input samplesize and 119896 is the dimension of the data By a rule of thumbwe select the kernel width as ℎ = 119899minus1(119896+1) For ℎ1 it is onlyused in Example 1 to estimate the edge function where thesample size is1198722 = 400 and the data dimension is two timesof the dimension of node space thus 119896 is 4 For ℎ2 and ℎ3they are used in both examples for estimating the distributionfunction 119903 thus data dimension 119896 is always 2The sample size

10 Complexity

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

(a) Fitting accuracy for fully nonparametric network

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

00

00

(b) Fitting accuracy for block network

Figure 1 Fitting accuracy for networks in Examples 1 and 2

for ℎ2 is 200 because it is used to turn the real observed 119903 on200 nodes to its kernel smooth version and the sample size forℎ3 is 20 because it turns the estimated 119903 on 20 sampled nodesto its values on the full node set

For the inference of the block network the number ofblock119876 is usually not known in prior so it is also a parameterto estimate As119876 determines the model dimension we adoptthe classical Bayesian information criteria (BIC) introducedin Schwarz [53] to detect the correct model dimension Asdefined in Schwarz [53] the greater BIC for a fitted modelimplies the better explanatory power [53] therefore the bestchoice of119876 corresponds to the maximal BIC In practice it isnot possible to calculate the BIC value for all positive119876 so wefollow the convention and only compute the BIC on a smallset of 119876 isin 1 2 3 4 5 The 119876 associated with the maximalBIC and the corresponding estimates of 1198641 1198642 are selectedas the final estimators and reported in the following In ourexample the correct119876 = 3 is always achieved so we omit thistrivial result

In Figure 1 we plot the difference between the real edgefunction and the NPSML estimated edge function on the setG timesG of node pairs for both examples where the horizontalaxis represents the true value of edge weight on every nodepair and the vertical axis represents the estimated weighton the same node pair To facilitate visualization Figure 1is sorted according to the horizontal axis in an ascendingmannerThe red dots represent the pairs of (estimatedweighttrue weight) the blue line sketches the identity function 119910 =119909 therefore a red dot being closer to the blue line meansthe better fitting accuracy Apparently for most of nodepairs the difference is negligible To further verify this visualjudgement 1205942 test is carried out for every node pair (119909 119910) isinGtimesGwith the null hypothesis 119864119909119910 = (119864(119909 119910)minus119864(119909 119910))2 = 0Following the asymptotic normality of NPSML estimator 119864 atevery (119909 119910) the distribution of test statistics 1198641199091199101205902119909119910 undernull hypothesis should be a 1205942 distribution with degree offreedom 1 where 120590119909119910 is the asymptotic variance of estimator119864(119909 119910) which can be calculated by bootstrap method Wecount the number of node pairs that fail to support the nullhypothesis at 90 credential level the result shows that in

Table 1 Estimation accuracy of 1198642Entries Bias Std P value119864211 0021 0032 0468119864212 -0006 0012 0383119864213 -0003 0029 0057119864221 -0001 0029 0028119864222 0022 0022 066119864223 -0002 0028 0059119864231 0005 0024 0165119864232 0018 0029 048119864233 0016 0021 0554

both examples only less than 10 out of all 10000 evaluationpairs in G times G fail to support the null hypothesis So ourestimation accuracy is quite satisfactory which agrees withthe visualization in Figure 1

For the block network Example 2 Table 1 presents theentry-wise accuracy of estimated 1198642 relative to (23) thefirst column presents the estimation bias the second andthird columns are the empirical standard deviation and theempirical P-values of the estimates from which we canconclude that the fitting accuracy is relatively perfect

For robustness check we also consider the synthetic datagenerated for different 119889119905 isin 001 005 01 015 02 and theimplementation of NPSML estimation on node samples withdifferent size 1198721 and 1198722 When 1198721 and 1198722 are increasedto 100 and 10 000 respectively no significant difference canbe detected in terms of the estimation accuracy measured bythe entry-wise bias between the true and the estimated edgeweight so we omit to plot this result For the rejection ratio at90 credential level of the null hypothesis that the true andestimated edgeweight are identical this ratio is lowered downa bit for the block network to less than 6 but no significantdecreasing can be detected for the general network exampleThis observation might be caused by the fact that for generalnetwork there are much more free parameters to estimatewhich reduces the convergence speed As for the different119889119905 the variation of estimation accuracy is not significant in

Complexity 11

all aspects this fact agrees with the discussion in the end ofSection 34

5 Experiment with Rumor Spreadingon Twitter

To demonstrate the usefulness of the NPSML method inreal-world applications we carry out an experiment with thedistribution flow data of a real rumor spreading process onTwitter We collect a data set of tweet articles with regardto the famous event ldquoUnite the Right rallyrdquo The ldquoUnite theRight rallyrdquo also known as the Charlottesville rally or Char-lottesville riots was a white supremacist rally that occurredin Charlottesville Virginia from August 11 to 12 2017 Therally occurred amidst the backdrop of controversy generatedby the removal of Confederate monuments throughout thecountry in response to the Charleston church shootingin 2015 The event turned violent after protesters clashedwith counter-protesters leaving over 30 injured The rallyalso attracted wide attentions on Twitter Twitter users ledvigilante campaigns on the platforms to personally identifyand denounce individual marchers in the rally following thestart of the campaignmany of themarchers were shamed andvilified by the social media community with several of therally attendees being dismissed from their jobs as a result ofthe campaign

Although the rally occurred in Charlottesville originallymessages andor comments related to it are immediatelyspread out through Twitter to users in many other placesincluding all major cities in US which inspired subsequentvigils and demonstrations in a number of cities across thecountry in the following days from Aug 11 and 12 2017 Tothis event we collect a time series of user level information(during the time from Aug 11 to Sep 4 2017) that recordedall Twitter user accounts in 20+ cities that spread at leastonce any messagecomment related to the rally during thecollection period We also collect the reaction time of everyuser to relevant messages and the user-specific informationsuch as the number of followers friends that an user has andhow many tweets the user has published in the past (historyposts) In addition the registration location of the Twitteraccount and its corresponding latitude and longitude are alsocollected

Similar to most rumor spreading data it is not possible totrack how every single message is spread from user to user byour collected data thus there is no way to directly identifythe interaction network among users But it is possible togenerate the distribution flows of users who have joined thespreading process Formally we can define at each time point119905 that a user has joined the process if and only if by 119905 heshehas at least reacted once to the messagescomments relatedto the rally then the data set can be easily converted to day-by-day distribution flows where at every time (day) 119905 sincethe origin (Aug 11 2017) we have an 119873-dimensional 0 1-valued vector with119873 being the number of all users in recordThe 119894th coordinate takes value 1 if and only if the 119894th user hasreacted to the rally-message at least once by 119905

For such a distribution flow data set we are interestedin making inference of features of the interaction network

between users because they are useful for making predictionfor the other spreading processes on Twitter regarding thesimilar social events To that end we apply the NPSMLmethod to estimate the hidden interaction network from theflow data Since there are 100000+ users in our record andit is likely that many users belong to the same latent group sothat their response pattern is similar to their common groupmembers it is more appropriate to assume the interactionnetwork behind our flow data is a block network and thenapply the NPSML to the block network model discussed inSection 35

To uncover the dependence of interaction links betweenusers on their geographical features andor friendshipfol-lowership relation we embed nodes(users) of the interac-tion network into a 5-dimensional feature space with thecoordinates representing the latitude longitude of accountlocation the number of friends followers and history postsrespectively To reduce the computation burden we adopt thebootstrap method randomly pick 10000 users from the fullset of users for 10 times and estimate the block network oneach of the subsamples For every subsample an estimatorfor membership weight function 1198641 and interaction matrix1198642 can be derived The aggregated estimator for interactionmatrix 1198642 is averaged over all subsample estimators for theblock membership weight 1198641 the aggregated estimator isderived by maximum a posteriori from the set of subsampleestimators

For robustness check we select 119889119905 isin 001 005 01 02to solve (9) As block network is used there is no need todraw the1198722 samples of node pairs only1198721 sampled nodesare needed for evaluating 119903 To reduce computation burdenwe consider to take a much smaller 1198721 than the number ofall users in record (10000+) to approximate the membershipweight function 1198641 and distribution function 119903 To check therobustness of our estimation with respect to different choiceof1198721 we preliminarily run the estimation program on a setof different 1198721 isin 50 100 200 500 The feature vectorof the 1198721 nodes in each trail is selected by conducting aK-means clustering on the full sample with the number ofclusters equal to 1198721 then the set of cluster centres will beselected as the feature vector Such selected feature vectorfor the1198721 nodes distributes asymptotically in the same waywithin the feature space as for the full sample of nodes Thepreliminary result shows that the estimators are not sensitiveto different choice of 119889119905 and become stable when1198721 is greaterthan 50 Therefore we will fix 119889119905 = 02 and 1198721 = 100 the100 cluster centres are also used as the evaluation nodes forthe estimated function 1198641

The choice of best block number is still based on max-imization of BIC value We plot the BIC for the three casesthat the block number equals to 3 4 and 5 in Figure 2 andthe BIC reaches its maximum when block number is 4 so weconsider a block network with 4 blocks as the final model forfurther analysis

Different visualizations of the block network are pro-vided Figure 3 sketches the geographic range of everyblockcommunity of the Twitter network the amount offollowers friends and history posts is plotted along with

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 5: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

Complexity 5

in contrast the density for the event that 119909 is observed to beuninfected at 119905 is given as

p0 (119909 119905) = 119891 (119909) (1 minus 119903 (119909 119905)) (4)

Suppose that given a time 119905 the infectious status of a set ofrandomly picked nodesN isin R119901 is observable and represent-ed as

O119905 = 119868 (119909 119905) 119909 isinN (5)

with 119868(119909 119905) = 0 being not infected and 119868(119909 119905) = 1 beinginfected then the likelihood function of the observations O119905can be written in the following way by using (3) and (4)119871 (O119905 119864)= prod

119909isinN

(119891 (119909) 119903 (119909 119905))119868(119909119905) (119891 (119909) (1 minus 119903 (119909 119905)))1minus119868(119909119905) (6)

where we add the edge function 119864 into likelihood becauseit affects 119871 through determining the functional form of 119903Maximizing (6) can yield the classical maximum likelihood(ML) estimator of 11986433 Nonparametric Likelihood Estimator and Kernel Smooth-ing In the study of spreading process only the distributionflows of the form (5) are available the details of link structurebetween nodes represented by edge function 119864 are notobservable thus need to be estimated In this section weconstruct a nonparametric simulated maximum likelihoodestimator (NPSML) to the functional form of 119864 given theobserved distribution flows O119905119894 119894 = 1 119879 1199051 lt sdot sdot sdot lt119905119879 on a sequence of time The NPSML is an efficient non-parametric inference technique proposed by Kristensen andShin [29] NPSML applies well to the case where an explicitexpression of the likelihood function is not achievable whichis exactly what we need to handle because the distributionfunction 119903 in (6) is the solution to the functional differentialequation (1) there is no clean analytic expression available forit

However our task is different from the situation discussedoriginally in Kristensen and Shin [29] First the originalNPSML applies nonparametric kernel smoothing to approxi-mate the unknown likelihood function the model generatingthe likelihood function is still parametric but in (6) thelikelihood depends on the nonparametric edge function 119864To this situation one extra kernel smoothing step is needed toapproximate119864 Second in Kristensen and Shin [29] Kukackaand Barunik [36] simulation is conducted on the level ofrandom variable while in our case simulation is on thelevel of distribution that is equivalent to numerically solvethe mean-field equation (1) Finally due to the involvementof nonparametric model setup the model identifiability hasto be checked in order to guarantee the correctness of theresulting estimation

Due to the first and second differences we provide thefollowing algorithm to generate the simulated likelihoodfunction (in the following constructions we always use119870119901 to

denote the119901-dimensional standardGaussian kernel function119870119901ℎ(119909) = 119870119901(119909ℎ)ℎ119901 for some positive constant ℎ)

Step 1 Select constant 119889119905 gt 0 large positive integer 1198721and 1198722 (119889119905 is the length of every time step used fornumerically solving the functional differential equation (1)1198721 and1198722 are the number of random samples that will bedrawn to generate the kernel smoothing approximation to theunknown likelihood function and edge weight function)

Step 2 Draw 1198721 random samples 1199091 1199091198721 isin R119901 fromdistribution 119865 and1198722 random samples 1199081 1199081198722 isin R119901 timesR119901 from the product measure 119865 otimes 119865Step 3 Given 1198901 1198901198722 isin [0 1] construct function 119864 asfollows

119864 (119908) = sum1198722119894=11198702119901ℎ1 (119908 minus 119908119894) sdot 119890119894sum1198722119895=11198702119901ℎ1 (119908 minus 119908119895) (7)

Step 4 Given 119905119894 let O119905119894 = 119868(1199101 119905119894) 119868(119910119872 119905119894) denote theobservation set at time 119905119894 whose cardinality is119872 constructingfunction 119903( 119905119894) as follows

119903 (119910 119905119894) = sum119872119897=1119870119901ℎ2 (119910 minus 119910119897) sdot 119868 (119910119897 119905119894)sum119872119895=1119870119901ℎ2 (119910 minus 119910119895) (8)

Step 5 Solve mean-field equation (1) over interval [119905119894 119905119894+1) atthe set of sample point 1199091 1199091198721 drawn in Step 2 byEulerrsquosmethod with time step 119889119905 subject to the initial condition119903( 119905119894) as follows119903 (119909119895 119905119894 + (119896 + 1) sdot 119889119905)= 119903 (119909 119905119894 + 119896 sdot 119889119905) + (1 minus 119903 (119909119895 119905119894 + 119896 sdot 119889119905)) sdot 1198891199051198721

sdot 1198721sum119897=1

119864 (119909119895 119909119897) 119903 (119909119897 119905119894 + 119896 sdot 119889119905)(9)

where 119896 = 0 1 lfloor(119905119894+1 minus 119905119894)119889119905rfloor lfloor119886rfloor is the greatest integerless than 119886Step 6 For the observation set O119905119894+1 = 119868(1199101 119905119894+1) 119868(1199101198721015840 119905119894+1) at 119905119894+1 with cardinality 1198721015840 generate the simulateddensity at the sample nodes 119910119897 119897 = 1 1198721015840 as follows

119903 (119910119897 119905119894+1) = sum1198721119895=1119870119901ℎ3 (119910119897 minus 119909119895) sdot 119903 (119909119895 119905119894+1)sum1198721119895=1119870119901ℎ3 (119910119897 minus 119909119895) (10)

and construct the simulated likelihood function as follows

(O119905119894+1 1198901 1198901198721) = 1198721015840prod119897=1

(119891 (119910119897) 119903 (119910119897 119905119894+1))119868(119910119897 119905119894+1)sdot (119891 (119910119897) (1 minus 119903 (119910119897 119905119894+1)))1minus119868(119910119897119905119894+1) (11)

6 Complexity

The full information likelihood function for all observa-tion time can be constructed from (11) in the following waylowast (O119905119894 119894 = 1 119879 1198901 1198901198722)

= 119879prod119894=1

(O119905119894 1198901 1198901198722) (12)

The estimator of unknown edge function 119864 can be derivedfrom maximizing the simulated full information likelihoodfunction (12) by selecting appropriate 1198901 1198901198722 the finalestimator 119864lowast is constructed from the optimal 119890lowast1 119890lowast1198722 inthe way of (7)

Comparing to NPSML in Kristensen and Shin [29] thealgorithm in our study includes one extra sampling step todraw 1198722 random points from R119901 times R119901 which are usedfor approximating unknown 119864 In addition there are twokernel smoothing steps (Steps 4 and 6) regarding the densityfunction 119903 one for the initial density in the starting time 119905119894and the other for the end-time density at 119905119894+1 The two kernelsmoothing steps are not required when the total number ofnodes are small (a few hundred or a few thousand) in whichcase the whole set of nodes is directly used as the1198721 samplesdrawn in Step 2 However when the system has a giant nodeset (say millions) the sample size1198721 ≪ 119872 can be applied inorder to lift the computation efficiency Moreover the nodesets being observed at different observation time may notalways be identical it is more often the case that when a nodeis tracked to be uninfected at some time 119905 it will be regardedas safe and missing from the consecutive tracking in the nextfew observation time points In this interval-censor situationthe 1198721 sampled nodes and the two kernel smoothing stepsare needed to avoid the noise induced by censoring

As documented in Kristensen and Shin [29] Kukackaand Barunik [36] the NPSML estimator does not suffer fromthe ldquocurse of dimensionrdquo despite its nonparametric essencebecause the number of simulation samples is independentfrom the number of observation samples When the latter islarge the inefficiency induced by kernel smoothing vanishesduring the aggregation involved in the likelihood functionBy the same argument and the fact that in most real-world applications the number of observed nodes is giantour modified NPSML estimator is free from the curse ofdimensionality as well

34 A Fast Algorithm As shown in (9) the estimationprocedure requires repeated evaluation of the multiplicationbetween a 1198721 times 1198721 matrix and a 1198721 dimensional vectorthe computation complexity is of the order11987221 Although1198721can be taken as much smaller than the number of nodes inobservations (119872) it still has to increase as 119872 increases Sowhen 119872 is a giant number 1198721 has to be large as well thecomputation complexity of the entire estimation procedurewill be dominated by 11987221 In this section we propose a fastalgorithm which can reduce the computation complexity in(9) to be linearly dependent on 1198721 that is reasonable andimplementable in practice

The idea of the fast algorithm comes from the techniqueof agent-based simulation (ABS) In every iteration of ABS

every agent in the network is only required to interact withanother agent randomly picked from its neighbor In oursetting there is no strict ldquoneighborrdquo defined while it isstill possible to randomly pick one agent from the entirepopulation and the interaction is only counted on the givenagent and its randomly picked partner Formally Step 5 inprevious paragraph is split to three substeps

Step 5(1) For fixed 119905 and fixed 119909119895 isin 1199091 1199091198721 randomlypick one 119909119897(119895 119905) from 1199091 1199091198721Step 5(2) Compute119903 (119909119895 119905 + 119889119905) = 119903 (119909119895 119905) + (1 minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905) 119889119905 (13)

Step 5(3) Repeat the above two steps for all 119905 = 119905119896 119896 =0 1 lfloor(119905119894+1 minus 119905119894)119889119905rfloor minus 1 and for all 119905119894sComparing (9) and (13) the main difference is that the

inner product of vectors (ie the sum over 1199091 1199091198721) isreplaced with a scalar multiple so the resulting computationcomplexity for all1198721 nodes linearly depends on1198721 which issignificantly faster than the original algorithm

For the accuracy of the fast algorithm we claim that com-pared to the original algorithm the accuracy loss inducedby the fastness is controlled by a constant multiple of Δ119905 =max119905119894+1 minus 119905119894 for all 119894 In fact due to the randomness of119909119897(119895 119905)s it is easily to verify the following

(i) the expectation of the left hand side of (9) is identicalto the expectation of left hand side of (13)

(ii) denoteΔ(119895 119905) as the increment Δ(119895 119905) = (1minus119903(119909119895 119905))sdot119864(119909119895 119909119897(119895 119905))119903(119909119897(119895 119905) 119905) then for 119905119894 le 119905 1199051015840 le 119905119894+1 1 le119895 1198951015840 le 1198721 and all 119905119894s cov(Δ(119895 119905) Δ(1198951015840 1199051015840) | 119903(119909119895 119905119894)) le119905119894+1 minus 119905119894The property (i) and the identity for 1198951015840 = 119895 in (ii) are quitetrivial For 119905119894 lt 119905 lt 1199051015840 lt 119905119894+1 then cov(Δ(119895 119905) Δ(119895 1199051015840) |119903(119909119895 119905119894)) can be decomposed as the sum of the following twocomponents119860 = cov (Δ (119895 119905) (1 minus 119903 (119909119895 119905)) sdot 119864 (119909119895 119909119897 (119895 1199051015840))sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= var (119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840))le var (119903 (119909119895 119905) | 119903 (119909119895 119905119894)) = var (119903 (119909119895 119905)

minus 119903 (119909119895 119905119894) | 119903 (119909119895 119905119894)) le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894)2le (119905119894+1 minus 119905119894)2

Complexity 7

119861 = cov (Δ (119895 119905) (119903 (119909119895 1199051015840) minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 1199051015840)) sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= cov (1 minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840)minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840)) le cov (1minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840) minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))le 119864 (10038161003816100381610038161003816119903 (119909119895 1199051015840) minus 119903 (119909119895 119905)10038161003816100381610038161003816 | 119903 (119909119895 119905119894))le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894) le (119905119894+1 minus 119905119894)

(14)

where sdot infin is the 119871infin norm of a bounded valued functionThe above inequality holds straightforwardly from the fact 119903is bounded by 1 and its temporal derivative is given by (1)which is also uniformly bounded by 1 then the statement(ii) follows immediately

Using Property (i) (ii) and the law of large number itis straightforward that the difference between the likelihoodfunction constructed from (9) and by (13) is bounded by aconstant multiple of Δ119905 as the number of nodes119872 997888rarr infinIf we further require Δ119905 997888rarr 0 along with 119872 997888rarr infin thetwo types of calculation of the likelihood function would beasymptotically identical which leads to the same estimator tothe hidden network

Also notice that by the fast algorithm the choice of 119889119905 isindependent with the estimation accuracy so in practice itcan be selected directly as 119905119894+1 minus 119905119894 to increase the speed35 Block Network The NPSML algorithm constructed inprevious section can be further extended to make inferencefor the block network model As in many applications [3338 39] the existence of connection between two agents isonly relevant to the groups they belong to and the features ofagents only affect which group they are assigned to Withoutloss of generality the set of 119876 groups can be considered as apartition of the set of all nodes then the edge function canbe decomposed as two components

(i) the group weight function 1198641 R119901 997888rarr [0 1]119876(ii) the group-level edge weight 1198642 which is a 119876 times 119876

matrix with each entry valued in [0 1]The edge function 119864 for the block network model can berecovered from (i) and (ii) as follows119864 (119909 119910) = 1198641 (119909)⊤ 11986421198641 (119910) (15)

where the image of 1198641 is viewed as a119876-dimensional columnsvector and the subscript ⊤ represents vector transpose The

group weight function is required to satisfy that for every 119909and 1198641(119909) = (1199041 119904119876) there exist only one 119894 isin 1 119876with 119904119894 gt 0 which means every node can only have positiveprobability to belong to at most one group which guaranteesthe requirement that groups constitute a partition of the nodeset

The estimation of block network is equivalent to theestimation of (1) the group weight function 1198641 which isunknown and consists of the fully nonparametric componentof the network and (2) the interaction matrix 1198642 which is theparametric component of the network So the estimation isessentially semiparametric The six-step algorithm discussedin Section 33 and the fast algorithm in Section 34 are stillapplicable to that case The only modification is for Step 3where the kernel smoothing method is no longer applied tothe unknown edge weight 119864 Instead it is applied to generatethe estimate to group weight 1198641 Then the hidden weightfunction 119864 is constructed from the kernel smoothed 1198641 andthe given interaction matrix 1198642 in the way of (15)

Block network model has many advantages For instancewhen the number of groups involved is small and does notdepend on the number of nodes the number of parametersto solve is only1198721119876+1198762 while the number is1198722 when thereis no block structure at all To generate good approximationto the true edge function 1198722 has to increase along withthe number 11987221 (although slowly) when the node numberin observation is giant 1198721 has to be large as well then1198722 ≫ 1198721119876 + 1198762 Through block network we can sharplyreduce the dimension of parameter space when solving themaximum likelihood problem which can significantly lift thecomputation efficiency

In addition block network is much easier to identifythan the general fully nonparametric networks which will bediscussed in the next section Finally under block networkthe equilibrium infectious distribution of the spreading pro-cess has a clear analytic expression as stated in the followingproposition (proof for Proposition 1 is quite trivial henceomitted)

Proposition 1 Denote 1198641119894 (119909) as the projection of vector 1198641(119909)to its 119894th coordinate Define G119894 = 119909 isin R119901 1198641119894 (119909) gt 0that consists of the set of nodes belonging to group 119894 thenwithin a mean-fieldmodel of the form (2) with edge function 119864given by (15) every equilibrium infection distribution 119903(119909) (iesatisfying (1 minus 119903(119909)) sdot int119901

R119864(119909 119910)119903(119910)119889119865(119910) equiv 0) must have the

following form119903 (119909)= 0 119894119891 119909 isin G119894 P119894 (1198642)119899 119903 (119910 1199050) equiv 0 119891119900119903 119886119897119897 119910 119899 gt 01 119890119897119904119890 (16)

where 119903(119910 1199050) is the prescribed initial distribution of infectiousstatusP119894 is the projection of a vector to its 119894th dimension and(1198642)119899 denotes the 119899th power of matrix 1198642

Proposition 1 is meaningful in the sense that it links thetypes of equilibria infectious distribution with the matrix

8 Complexity

algebra facilitating the qualitative analysis of the equilibriadistribution For instance when 1198642 is an upper trianglematrix with all its lower off-diagonal entries being zero andall diagonal and upper off-diagonal entries being strictlypositive such as in (17)

(((((

119909 119909 119909 sdot sdot sdot 1199090 119909 119909 d0 sdot sdot sdot 119909 sdot sdot sdot 119909 d 0 119909 1199090 sdot sdot sdot 0 0 119909)))))

(17)

then the equilibriumdistribution 119903 and the initial distribution119903( 1199050) satisfy the relation119903 (119909) = 1 iff 119909 isin 1198761015840⋃

119894=1

G119894 lArrrArr119903 (119909 1199050) gt 0 iff 119909 isin 119876⋃

119894=1198761015840+1

G119894

(18)

36 Validity of NPSML Due to the nonparametric natureof the edge function 119864 its identifiability is tricky When thespreading process can be observed for multiple times (119898times) with random initializations and 119898 is large as assumedin Roudi and Hertz [41] Shen et al [40] both of the fullynonparametric network 119864 and the block network (1198641 1198642)are identifiable However in real applications a spreadingprocess can at most be observed for a few times it is notexpected that 119898 can be very large In that case the fullynonparametric edge function 119864 is no longer fully identifiableie there exists 119864 = 1198641015840 that leads to the same likelihoodfunction (6) in the limit case However it can be shownthat 119864 is identifiable up to compact convex set ie the setS1198640119903(1199050)119864 119871(O119905 119864) = 119871(O119905 1198640) is a compact convex setwithin the function space 1198712(R119901 times R119901) where 1198640 stands forthe true value of edge function It can also be proved that thesetS1198640119903(1199050) also varies along with the initial infectious status119903( 1199050) Formally we have that 119864 isin S1198640119903(1199050) if and only if thefollowing holds for all 119899 = 1 (M1minus119903(1199050)K119864)119899 119903 ( 1199050) equiv (M1minus119903(1199050)K1198640)119899 119903 ( 1199050) (19)

where K119864 is a bounded operator over the functionalspace 1198712(R119901 defined through 119864 as (K119864119892)(119909) fl int

R119901119864(119909119910)119892(119910)119889119865(119910) for every 119892 isin 1198712(R119901) with 119865 being the

default node distribution M119891 is the multiplicative operatordetermined by 119891 such that (M119891119892)(119909) = 119891(119909) sdot 119892(119909) the 119899thpower in (19) represents the self-composition of an operatorfor 119899 times (19) implies that the identifiability of the true edgefunction 1198640 is limited by the extent of the ergodicity of thespreading process within the node space R119901 For instancewhen there exists a small open set 119880 sub R119901 such that allnodes 119909 isin 119880 are infected before the initial time 1199050 ie119903(119909 1199050) equiv 1 for all 119909 isin 119880 then it can be verified by (19)

that all functions 119864 that deviate from 1198640 only within the bandset 119880 times R119901 are contained in S1198640 On the other hand if thereexists open 1198801015840 sub R119901 such that (M1minus119903(1199050)K1198640)119899119903(119909 1199050) equiv 0for all 119909 isin 1198801015840 and all 119899 then all functions 119864 that deviatefrom 1198640 only within 1198801015840 times 1198801015840 are contained in S1198640119903(1199050) Inboth of the two cases nodes in 119880 or 1198801015840 are not in the ergodicrange of the spreading process hence the transmission oftheir infectious status is not observable For nodes in119880 theirinfections occur ahead of the observation period hence notobservable after the start of spreading while for nodes in 1198801015840it can be verified that they will never be infected over theentire spreading processTherefore the identifiability of 1198640 isrestricted by the experience of the spreading process whichis reasonable

It is still an open question what conditions added to 1198640andor 119903( 1199050) can guarantee the identifiability of the fullynonparametric 1198640 But in the special case of block networksone simple identifiability condition can be figured out Infact for block networks it is straightforward that (11986410 11986420)is identifiable if and only if there does not exist a (1198641 1198642)pair that differs from the true (11986410 11986420) but leads to the samelikelihood function (6) in the limit case if and only if forthe true 11986420 the vector space spanned by the family of vectorsV119905 119905 ge 1199050 is the entire feature space R119876 ie V119905 119905 ge 1199050has full rank 119876 is the number of blocks V119905 = (V1199051 V119905119876)⊤is a 119876-dimensional column vector for every 119905 and for each119902 = 1 119876 V119905119902 = intR119901 11986410119902(119909)119903(119909 119905)119889119865(119909) 11986410119902 is the 119902thentry of 11986410(119909) To reach the full rank condition the well-known Wronskian determinant [51] can be applied leadingto the following clean-form identifiability condition

det V1199050 diag (119888 minus V1199050) 11986420V1199050 (diag (119888 minus V1199050) 11986420)119876minus1sdot V1199050 = 0 (20)

where 119888 is the other 119876-dimensional column vector (1198881 119888119876)⊤ determined by the true 11986410 function such that 119888119902 =intR11990111986410119902(119909)119889119865(119909) for 119902 = 1 119876 diag is the operation that

convert a 119876-dimensional vector to a 119876 times 119876 matrix with itsdiagonal elements being the given vector By the polynomialnature of the determinant function it can be verified that (20)holds ldquogenericallyrdquo in the sense that the set of 1198642s that forces(20) to be constantly equal to 0 is contained in an 119876 times 119876 minus 1dimensional surface within [0 1]119876times119876 and for those 1198642s that(20) is not constantly 0 the set of V1199050 that forces (20) to be 0 isonly contained in a119876minus 1 dimensional surface within [0 1]119876Therefore (20) holds for almost all 1198642 and V1199050 except forsome extreme cases that have measure 0 under the standardLebesgue measure

The ldquoalmostrdquo identifiability for block networks guaranteesthat in most cases when the number of observed nodesis large and the distribution of observation time is densethe estimated 1198641 and 1198642 from the NPSML asymptoticallyconverge to their true values and point-wisely follow multi-variate normal distributions This asymptotic result followsstraightforwardly from Kristensen and Shin [29] Kukacka

Complexity 9

and Barunik [36] and the general properties of maximumlikelihood estimator So the theoretical validity of the esti-mators developed in previous sections is established

Remark 2 (sparsity) Although in general the complete iden-tifiability for both the general network and the block networkis hard to achieve but if we follow the idea in the networkreconstruction literature Shen et al [40] only concentrateon the case that the hidden network is as sparse as possiblein the sense the 1198712 norm of the edge weight function11986422 = intR119901timesR119901(119864(119909 119910))2119889119865(119909)119889119865(119910) for the general networkandor the entry-wise square sum of the block network119864222 = sum119894119895(1198642119894119895)2 (this is the 1198712 norm on the discreteset with cardinality 1198762) is as small as possible To automatethe selection of the sparsest network we can consider the1198712 norm function as a penalty and subtract it from thelog-likelihood function (6) and then optimizing (6) wouldguarantee the solution converging to the sparsest networkIt is easily verified that such a sparse solution is alwaysasymptotically unique because as we discussed in previousparagraphs all networks that can lead to exactly the samelog-likelihood function form a compact convex set in thefunctional space by the compactness and convexity therealways exists a unique 119864 (or 1198642) such that its 1198712-distance tothe origin reaches the minimum

4 Numerical Experiment with Synthetic Data

Two synthetic data sets are generated from simulation totest the effectiveness of the NPSML estimator designed inprevious sections one for the fully nonparametric networkand the other for the block network For both examples thenode set N consists of 200 nodes which are drawn purelyrandomly from the unit cube [0 1)2 thus these nodes followthe uniform distribution Consider the following modelsetup

Example 1 (full nonparametric network) Edge function 119864 isnegatively proportional to the standard Euclidean distancebetween two nodes ie

119864 (119909 119910) = 1 minus radic⟨119909 minus 119910 119909 minus 119910⟩2 (21)

Example 2 (block network) Set 119876 = 3 block membershipfunction 1198641 satisfies

1198641 (119909 119910) = (1 0 0) 119894119891 119909 + 1199102 lt 13 (0 1 0) 119894119891 13 ge 119909 + 1199102 lt 23 (0 0 1) 119890119897119904119890 (22)

Matrix 1198642 is given as follows

1198642 = ( 0 1 0508 0 03001 0 0 ) (23)

For both examples the spreading process is initializedas that 30 of all nodes are infected at the very beginningand the infected nodes are randomly picked from the nodeset The full spreading process is generated from a discreteversion of (2) with sufficiently small time step (eg 119889119905 = 001that makes the resulting distribution flows as the first-orderapproximation to the true flows) a coarse time step (119889119905 = 01)is used for the estimation procedure (9) in order to test therobustness The process is followed up until day 5 ie thetime horizon in this simulation study is [0 119905) with 119905 = 5The observation of the distribution flows is supposed to beavailable only at the initial time and the end of every day iethere are 6 chances to observe the distribution of infectionsat 119905 = 0 1 2 3 4 5

For the fully nonparametric Example 1 the spreadingprocess is regenerated for 100 times with 100 random initial-izations this is necessary to address the identification issuesas pointed out in Section 36 For the 100 trails both the nodeset and the initial infectious subset are regenerated althoughtheir distributions are held constant For the block networkExample 2 the spreading process is generated only once inorder to evaluate the fitting performance under the situationthat no repeated observation of the spreading process isavailable For both examples the estimated edge function isevaluated on afixed set of grids for easy comparisonwhere thegrid set forms a lattice of the unit cube ieG = (01119896 01119897) 119896 119897 = 0 1 10

If all nodes are included in the computation of theNPSML estimator there are in principle a 40000(= 200 times200)-dimensional parameter space for full nonparametricnetwork Example 1 and a 609(= 200times3+3times3)-dimensionalparameter space for block network Example 2 to be searchedwhich are too time consuming As in the introduction ofNPSML estimator by the smoothness of edge functionthe number of nodes actually used to evaluate the edgefunction can be much smaller than the size of the entirenode set So to reduce computation load we generate another1198721 = 20 nodes from the uniform distribution which will beused in Step 3 (Section 33) for simulating the distributionfunction 119903 Accordingly the 1198722 = 400 node pairs willbe selected as the product of the 20 nodes for the fullynonparametric Example 1 then there are 400 parameters tooptimize in Example 1 and the size is quite reasonable formost nonparametric tasks For the block network Example 2as no node pairs are needed for block networks there areonly 69(= 20 times 3 + 3 times 3) parameters to optimize As for theselection of kernel width ℎ1 ℎ2 and ℎ3 we set ℎ1 = 400minus15ℎ2 = 200minus13 and ℎ3 = 20minus13 This is because the kernelsmooth method requires kernel width ℎ to satisfy 119899ℎ119896 997888rarr infinand 119899ℎ119896+2 997888rarr 0 in order to guarantee the consistency andasymptotic normality [28 29 36 52] where 119899 is input samplesize and 119896 is the dimension of the data By a rule of thumbwe select the kernel width as ℎ = 119899minus1(119896+1) For ℎ1 it is onlyused in Example 1 to estimate the edge function where thesample size is1198722 = 400 and the data dimension is two timesof the dimension of node space thus 119896 is 4 For ℎ2 and ℎ3they are used in both examples for estimating the distributionfunction 119903 thus data dimension 119896 is always 2The sample size

10 Complexity

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

(a) Fitting accuracy for fully nonparametric network

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

00

00

(b) Fitting accuracy for block network

Figure 1 Fitting accuracy for networks in Examples 1 and 2

for ℎ2 is 200 because it is used to turn the real observed 119903 on200 nodes to its kernel smooth version and the sample size forℎ3 is 20 because it turns the estimated 119903 on 20 sampled nodesto its values on the full node set

For the inference of the block network the number ofblock119876 is usually not known in prior so it is also a parameterto estimate As119876 determines the model dimension we adoptthe classical Bayesian information criteria (BIC) introducedin Schwarz [53] to detect the correct model dimension Asdefined in Schwarz [53] the greater BIC for a fitted modelimplies the better explanatory power [53] therefore the bestchoice of119876 corresponds to the maximal BIC In practice it isnot possible to calculate the BIC value for all positive119876 so wefollow the convention and only compute the BIC on a smallset of 119876 isin 1 2 3 4 5 The 119876 associated with the maximalBIC and the corresponding estimates of 1198641 1198642 are selectedas the final estimators and reported in the following In ourexample the correct119876 = 3 is always achieved so we omit thistrivial result

In Figure 1 we plot the difference between the real edgefunction and the NPSML estimated edge function on the setG timesG of node pairs for both examples where the horizontalaxis represents the true value of edge weight on every nodepair and the vertical axis represents the estimated weighton the same node pair To facilitate visualization Figure 1is sorted according to the horizontal axis in an ascendingmannerThe red dots represent the pairs of (estimatedweighttrue weight) the blue line sketches the identity function 119910 =119909 therefore a red dot being closer to the blue line meansthe better fitting accuracy Apparently for most of nodepairs the difference is negligible To further verify this visualjudgement 1205942 test is carried out for every node pair (119909 119910) isinGtimesGwith the null hypothesis 119864119909119910 = (119864(119909 119910)minus119864(119909 119910))2 = 0Following the asymptotic normality of NPSML estimator 119864 atevery (119909 119910) the distribution of test statistics 1198641199091199101205902119909119910 undernull hypothesis should be a 1205942 distribution with degree offreedom 1 where 120590119909119910 is the asymptotic variance of estimator119864(119909 119910) which can be calculated by bootstrap method Wecount the number of node pairs that fail to support the nullhypothesis at 90 credential level the result shows that in

Table 1 Estimation accuracy of 1198642Entries Bias Std P value119864211 0021 0032 0468119864212 -0006 0012 0383119864213 -0003 0029 0057119864221 -0001 0029 0028119864222 0022 0022 066119864223 -0002 0028 0059119864231 0005 0024 0165119864232 0018 0029 048119864233 0016 0021 0554

both examples only less than 10 out of all 10000 evaluationpairs in G times G fail to support the null hypothesis So ourestimation accuracy is quite satisfactory which agrees withthe visualization in Figure 1

For the block network Example 2 Table 1 presents theentry-wise accuracy of estimated 1198642 relative to (23) thefirst column presents the estimation bias the second andthird columns are the empirical standard deviation and theempirical P-values of the estimates from which we canconclude that the fitting accuracy is relatively perfect

For robustness check we also consider the synthetic datagenerated for different 119889119905 isin 001 005 01 015 02 and theimplementation of NPSML estimation on node samples withdifferent size 1198721 and 1198722 When 1198721 and 1198722 are increasedto 100 and 10 000 respectively no significant difference canbe detected in terms of the estimation accuracy measured bythe entry-wise bias between the true and the estimated edgeweight so we omit to plot this result For the rejection ratio at90 credential level of the null hypothesis that the true andestimated edgeweight are identical this ratio is lowered downa bit for the block network to less than 6 but no significantdecreasing can be detected for the general network exampleThis observation might be caused by the fact that for generalnetwork there are much more free parameters to estimatewhich reduces the convergence speed As for the different119889119905 the variation of estimation accuracy is not significant in

Complexity 11

all aspects this fact agrees with the discussion in the end ofSection 34

5 Experiment with Rumor Spreadingon Twitter

To demonstrate the usefulness of the NPSML method inreal-world applications we carry out an experiment with thedistribution flow data of a real rumor spreading process onTwitter We collect a data set of tweet articles with regardto the famous event ldquoUnite the Right rallyrdquo The ldquoUnite theRight rallyrdquo also known as the Charlottesville rally or Char-lottesville riots was a white supremacist rally that occurredin Charlottesville Virginia from August 11 to 12 2017 Therally occurred amidst the backdrop of controversy generatedby the removal of Confederate monuments throughout thecountry in response to the Charleston church shootingin 2015 The event turned violent after protesters clashedwith counter-protesters leaving over 30 injured The rallyalso attracted wide attentions on Twitter Twitter users ledvigilante campaigns on the platforms to personally identifyand denounce individual marchers in the rally following thestart of the campaignmany of themarchers were shamed andvilified by the social media community with several of therally attendees being dismissed from their jobs as a result ofthe campaign

Although the rally occurred in Charlottesville originallymessages andor comments related to it are immediatelyspread out through Twitter to users in many other placesincluding all major cities in US which inspired subsequentvigils and demonstrations in a number of cities across thecountry in the following days from Aug 11 and 12 2017 Tothis event we collect a time series of user level information(during the time from Aug 11 to Sep 4 2017) that recordedall Twitter user accounts in 20+ cities that spread at leastonce any messagecomment related to the rally during thecollection period We also collect the reaction time of everyuser to relevant messages and the user-specific informationsuch as the number of followers friends that an user has andhow many tweets the user has published in the past (historyposts) In addition the registration location of the Twitteraccount and its corresponding latitude and longitude are alsocollected

Similar to most rumor spreading data it is not possible totrack how every single message is spread from user to user byour collected data thus there is no way to directly identifythe interaction network among users But it is possible togenerate the distribution flows of users who have joined thespreading process Formally we can define at each time point119905 that a user has joined the process if and only if by 119905 heshehas at least reacted once to the messagescomments relatedto the rally then the data set can be easily converted to day-by-day distribution flows where at every time (day) 119905 sincethe origin (Aug 11 2017) we have an 119873-dimensional 0 1-valued vector with119873 being the number of all users in recordThe 119894th coordinate takes value 1 if and only if the 119894th user hasreacted to the rally-message at least once by 119905

For such a distribution flow data set we are interestedin making inference of features of the interaction network

between users because they are useful for making predictionfor the other spreading processes on Twitter regarding thesimilar social events To that end we apply the NPSMLmethod to estimate the hidden interaction network from theflow data Since there are 100000+ users in our record andit is likely that many users belong to the same latent group sothat their response pattern is similar to their common groupmembers it is more appropriate to assume the interactionnetwork behind our flow data is a block network and thenapply the NPSML to the block network model discussed inSection 35

To uncover the dependence of interaction links betweenusers on their geographical features andor friendshipfol-lowership relation we embed nodes(users) of the interac-tion network into a 5-dimensional feature space with thecoordinates representing the latitude longitude of accountlocation the number of friends followers and history postsrespectively To reduce the computation burden we adopt thebootstrap method randomly pick 10000 users from the fullset of users for 10 times and estimate the block network oneach of the subsamples For every subsample an estimatorfor membership weight function 1198641 and interaction matrix1198642 can be derived The aggregated estimator for interactionmatrix 1198642 is averaged over all subsample estimators for theblock membership weight 1198641 the aggregated estimator isderived by maximum a posteriori from the set of subsampleestimators

For robustness check we select 119889119905 isin 001 005 01 02to solve (9) As block network is used there is no need todraw the1198722 samples of node pairs only1198721 sampled nodesare needed for evaluating 119903 To reduce computation burdenwe consider to take a much smaller 1198721 than the number ofall users in record (10000+) to approximate the membershipweight function 1198641 and distribution function 119903 To check therobustness of our estimation with respect to different choiceof1198721 we preliminarily run the estimation program on a setof different 1198721 isin 50 100 200 500 The feature vectorof the 1198721 nodes in each trail is selected by conducting aK-means clustering on the full sample with the number ofclusters equal to 1198721 then the set of cluster centres will beselected as the feature vector Such selected feature vectorfor the1198721 nodes distributes asymptotically in the same waywithin the feature space as for the full sample of nodes Thepreliminary result shows that the estimators are not sensitiveto different choice of 119889119905 and become stable when1198721 is greaterthan 50 Therefore we will fix 119889119905 = 02 and 1198721 = 100 the100 cluster centres are also used as the evaluation nodes forthe estimated function 1198641

The choice of best block number is still based on max-imization of BIC value We plot the BIC for the three casesthat the block number equals to 3 4 and 5 in Figure 2 andthe BIC reaches its maximum when block number is 4 so weconsider a block network with 4 blocks as the final model forfurther analysis

Different visualizations of the block network are pro-vided Figure 3 sketches the geographic range of everyblockcommunity of the Twitter network the amount offollowers friends and history posts is plotted along with

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 6: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

6 Complexity

The full information likelihood function for all observa-tion time can be constructed from (11) in the following waylowast (O119905119894 119894 = 1 119879 1198901 1198901198722)

= 119879prod119894=1

(O119905119894 1198901 1198901198722) (12)

The estimator of unknown edge function 119864 can be derivedfrom maximizing the simulated full information likelihoodfunction (12) by selecting appropriate 1198901 1198901198722 the finalestimator 119864lowast is constructed from the optimal 119890lowast1 119890lowast1198722 inthe way of (7)

Comparing to NPSML in Kristensen and Shin [29] thealgorithm in our study includes one extra sampling step todraw 1198722 random points from R119901 times R119901 which are usedfor approximating unknown 119864 In addition there are twokernel smoothing steps (Steps 4 and 6) regarding the densityfunction 119903 one for the initial density in the starting time 119905119894and the other for the end-time density at 119905119894+1 The two kernelsmoothing steps are not required when the total number ofnodes are small (a few hundred or a few thousand) in whichcase the whole set of nodes is directly used as the1198721 samplesdrawn in Step 2 However when the system has a giant nodeset (say millions) the sample size1198721 ≪ 119872 can be applied inorder to lift the computation efficiency Moreover the nodesets being observed at different observation time may notalways be identical it is more often the case that when a nodeis tracked to be uninfected at some time 119905 it will be regardedas safe and missing from the consecutive tracking in the nextfew observation time points In this interval-censor situationthe 1198721 sampled nodes and the two kernel smoothing stepsare needed to avoid the noise induced by censoring

As documented in Kristensen and Shin [29] Kukackaand Barunik [36] the NPSML estimator does not suffer fromthe ldquocurse of dimensionrdquo despite its nonparametric essencebecause the number of simulation samples is independentfrom the number of observation samples When the latter islarge the inefficiency induced by kernel smoothing vanishesduring the aggregation involved in the likelihood functionBy the same argument and the fact that in most real-world applications the number of observed nodes is giantour modified NPSML estimator is free from the curse ofdimensionality as well

34 A Fast Algorithm As shown in (9) the estimationprocedure requires repeated evaluation of the multiplicationbetween a 1198721 times 1198721 matrix and a 1198721 dimensional vectorthe computation complexity is of the order11987221 Although1198721can be taken as much smaller than the number of nodes inobservations (119872) it still has to increase as 119872 increases Sowhen 119872 is a giant number 1198721 has to be large as well thecomputation complexity of the entire estimation procedurewill be dominated by 11987221 In this section we propose a fastalgorithm which can reduce the computation complexity in(9) to be linearly dependent on 1198721 that is reasonable andimplementable in practice

The idea of the fast algorithm comes from the techniqueof agent-based simulation (ABS) In every iteration of ABS

every agent in the network is only required to interact withanother agent randomly picked from its neighbor In oursetting there is no strict ldquoneighborrdquo defined while it isstill possible to randomly pick one agent from the entirepopulation and the interaction is only counted on the givenagent and its randomly picked partner Formally Step 5 inprevious paragraph is split to three substeps

Step 5(1) For fixed 119905 and fixed 119909119895 isin 1199091 1199091198721 randomlypick one 119909119897(119895 119905) from 1199091 1199091198721Step 5(2) Compute119903 (119909119895 119905 + 119889119905) = 119903 (119909119895 119905) + (1 minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905) 119889119905 (13)

Step 5(3) Repeat the above two steps for all 119905 = 119905119896 119896 =0 1 lfloor(119905119894+1 minus 119905119894)119889119905rfloor minus 1 and for all 119905119894sComparing (9) and (13) the main difference is that the

inner product of vectors (ie the sum over 1199091 1199091198721) isreplaced with a scalar multiple so the resulting computationcomplexity for all1198721 nodes linearly depends on1198721 which issignificantly faster than the original algorithm

For the accuracy of the fast algorithm we claim that com-pared to the original algorithm the accuracy loss inducedby the fastness is controlled by a constant multiple of Δ119905 =max119905119894+1 minus 119905119894 for all 119894 In fact due to the randomness of119909119897(119895 119905)s it is easily to verify the following

(i) the expectation of the left hand side of (9) is identicalto the expectation of left hand side of (13)

(ii) denoteΔ(119895 119905) as the increment Δ(119895 119905) = (1minus119903(119909119895 119905))sdot119864(119909119895 119909119897(119895 119905))119903(119909119897(119895 119905) 119905) then for 119905119894 le 119905 1199051015840 le 119905119894+1 1 le119895 1198951015840 le 1198721 and all 119905119894s cov(Δ(119895 119905) Δ(1198951015840 1199051015840) | 119903(119909119895 119905119894)) le119905119894+1 minus 119905119894The property (i) and the identity for 1198951015840 = 119895 in (ii) are quitetrivial For 119905119894 lt 119905 lt 1199051015840 lt 119905119894+1 then cov(Δ(119895 119905) Δ(119895 1199051015840) |119903(119909119895 119905119894)) can be decomposed as the sum of the following twocomponents119860 = cov (Δ (119895 119905) (1 minus 119903 (119909119895 119905)) sdot 119864 (119909119895 119909119897 (119895 1199051015840))sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= var (119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840))le var (119903 (119909119895 119905) | 119903 (119909119895 119905119894)) = var (119903 (119909119895 119905)

minus 119903 (119909119895 119905119894) | 119903 (119909119895 119905119894)) le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894)2le (119905119894+1 minus 119905119894)2

Complexity 7

119861 = cov (Δ (119895 119905) (119903 (119909119895 1199051015840) minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 1199051015840)) sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= cov (1 minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840)minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840)) le cov (1minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840) minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))le 119864 (10038161003816100381610038161003816119903 (119909119895 1199051015840) minus 119903 (119909119895 119905)10038161003816100381610038161003816 | 119903 (119909119895 119905119894))le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894) le (119905119894+1 minus 119905119894)

(14)

where sdot infin is the 119871infin norm of a bounded valued functionThe above inequality holds straightforwardly from the fact 119903is bounded by 1 and its temporal derivative is given by (1)which is also uniformly bounded by 1 then the statement(ii) follows immediately

Using Property (i) (ii) and the law of large number itis straightforward that the difference between the likelihoodfunction constructed from (9) and by (13) is bounded by aconstant multiple of Δ119905 as the number of nodes119872 997888rarr infinIf we further require Δ119905 997888rarr 0 along with 119872 997888rarr infin thetwo types of calculation of the likelihood function would beasymptotically identical which leads to the same estimator tothe hidden network

Also notice that by the fast algorithm the choice of 119889119905 isindependent with the estimation accuracy so in practice itcan be selected directly as 119905119894+1 minus 119905119894 to increase the speed35 Block Network The NPSML algorithm constructed inprevious section can be further extended to make inferencefor the block network model As in many applications [3338 39] the existence of connection between two agents isonly relevant to the groups they belong to and the features ofagents only affect which group they are assigned to Withoutloss of generality the set of 119876 groups can be considered as apartition of the set of all nodes then the edge function canbe decomposed as two components

(i) the group weight function 1198641 R119901 997888rarr [0 1]119876(ii) the group-level edge weight 1198642 which is a 119876 times 119876

matrix with each entry valued in [0 1]The edge function 119864 for the block network model can berecovered from (i) and (ii) as follows119864 (119909 119910) = 1198641 (119909)⊤ 11986421198641 (119910) (15)

where the image of 1198641 is viewed as a119876-dimensional columnsvector and the subscript ⊤ represents vector transpose The

group weight function is required to satisfy that for every 119909and 1198641(119909) = (1199041 119904119876) there exist only one 119894 isin 1 119876with 119904119894 gt 0 which means every node can only have positiveprobability to belong to at most one group which guaranteesthe requirement that groups constitute a partition of the nodeset

The estimation of block network is equivalent to theestimation of (1) the group weight function 1198641 which isunknown and consists of the fully nonparametric componentof the network and (2) the interaction matrix 1198642 which is theparametric component of the network So the estimation isessentially semiparametric The six-step algorithm discussedin Section 33 and the fast algorithm in Section 34 are stillapplicable to that case The only modification is for Step 3where the kernel smoothing method is no longer applied tothe unknown edge weight 119864 Instead it is applied to generatethe estimate to group weight 1198641 Then the hidden weightfunction 119864 is constructed from the kernel smoothed 1198641 andthe given interaction matrix 1198642 in the way of (15)

Block network model has many advantages For instancewhen the number of groups involved is small and does notdepend on the number of nodes the number of parametersto solve is only1198721119876+1198762 while the number is1198722 when thereis no block structure at all To generate good approximationto the true edge function 1198722 has to increase along withthe number 11987221 (although slowly) when the node numberin observation is giant 1198721 has to be large as well then1198722 ≫ 1198721119876 + 1198762 Through block network we can sharplyreduce the dimension of parameter space when solving themaximum likelihood problem which can significantly lift thecomputation efficiency

In addition block network is much easier to identifythan the general fully nonparametric networks which will bediscussed in the next section Finally under block networkthe equilibrium infectious distribution of the spreading pro-cess has a clear analytic expression as stated in the followingproposition (proof for Proposition 1 is quite trivial henceomitted)

Proposition 1 Denote 1198641119894 (119909) as the projection of vector 1198641(119909)to its 119894th coordinate Define G119894 = 119909 isin R119901 1198641119894 (119909) gt 0that consists of the set of nodes belonging to group 119894 thenwithin a mean-fieldmodel of the form (2) with edge function 119864given by (15) every equilibrium infection distribution 119903(119909) (iesatisfying (1 minus 119903(119909)) sdot int119901

R119864(119909 119910)119903(119910)119889119865(119910) equiv 0) must have the

following form119903 (119909)= 0 119894119891 119909 isin G119894 P119894 (1198642)119899 119903 (119910 1199050) equiv 0 119891119900119903 119886119897119897 119910 119899 gt 01 119890119897119904119890 (16)

where 119903(119910 1199050) is the prescribed initial distribution of infectiousstatusP119894 is the projection of a vector to its 119894th dimension and(1198642)119899 denotes the 119899th power of matrix 1198642

Proposition 1 is meaningful in the sense that it links thetypes of equilibria infectious distribution with the matrix

8 Complexity

algebra facilitating the qualitative analysis of the equilibriadistribution For instance when 1198642 is an upper trianglematrix with all its lower off-diagonal entries being zero andall diagonal and upper off-diagonal entries being strictlypositive such as in (17)

(((((

119909 119909 119909 sdot sdot sdot 1199090 119909 119909 d0 sdot sdot sdot 119909 sdot sdot sdot 119909 d 0 119909 1199090 sdot sdot sdot 0 0 119909)))))

(17)

then the equilibriumdistribution 119903 and the initial distribution119903( 1199050) satisfy the relation119903 (119909) = 1 iff 119909 isin 1198761015840⋃

119894=1

G119894 lArrrArr119903 (119909 1199050) gt 0 iff 119909 isin 119876⋃

119894=1198761015840+1

G119894

(18)

36 Validity of NPSML Due to the nonparametric natureof the edge function 119864 its identifiability is tricky When thespreading process can be observed for multiple times (119898times) with random initializations and 119898 is large as assumedin Roudi and Hertz [41] Shen et al [40] both of the fullynonparametric network 119864 and the block network (1198641 1198642)are identifiable However in real applications a spreadingprocess can at most be observed for a few times it is notexpected that 119898 can be very large In that case the fullynonparametric edge function 119864 is no longer fully identifiableie there exists 119864 = 1198641015840 that leads to the same likelihoodfunction (6) in the limit case However it can be shownthat 119864 is identifiable up to compact convex set ie the setS1198640119903(1199050)119864 119871(O119905 119864) = 119871(O119905 1198640) is a compact convex setwithin the function space 1198712(R119901 times R119901) where 1198640 stands forthe true value of edge function It can also be proved that thesetS1198640119903(1199050) also varies along with the initial infectious status119903( 1199050) Formally we have that 119864 isin S1198640119903(1199050) if and only if thefollowing holds for all 119899 = 1 (M1minus119903(1199050)K119864)119899 119903 ( 1199050) equiv (M1minus119903(1199050)K1198640)119899 119903 ( 1199050) (19)

where K119864 is a bounded operator over the functionalspace 1198712(R119901 defined through 119864 as (K119864119892)(119909) fl int

R119901119864(119909119910)119892(119910)119889119865(119910) for every 119892 isin 1198712(R119901) with 119865 being the

default node distribution M119891 is the multiplicative operatordetermined by 119891 such that (M119891119892)(119909) = 119891(119909) sdot 119892(119909) the 119899thpower in (19) represents the self-composition of an operatorfor 119899 times (19) implies that the identifiability of the true edgefunction 1198640 is limited by the extent of the ergodicity of thespreading process within the node space R119901 For instancewhen there exists a small open set 119880 sub R119901 such that allnodes 119909 isin 119880 are infected before the initial time 1199050 ie119903(119909 1199050) equiv 1 for all 119909 isin 119880 then it can be verified by (19)

that all functions 119864 that deviate from 1198640 only within the bandset 119880 times R119901 are contained in S1198640 On the other hand if thereexists open 1198801015840 sub R119901 such that (M1minus119903(1199050)K1198640)119899119903(119909 1199050) equiv 0for all 119909 isin 1198801015840 and all 119899 then all functions 119864 that deviatefrom 1198640 only within 1198801015840 times 1198801015840 are contained in S1198640119903(1199050) Inboth of the two cases nodes in 119880 or 1198801015840 are not in the ergodicrange of the spreading process hence the transmission oftheir infectious status is not observable For nodes in119880 theirinfections occur ahead of the observation period hence notobservable after the start of spreading while for nodes in 1198801015840it can be verified that they will never be infected over theentire spreading processTherefore the identifiability of 1198640 isrestricted by the experience of the spreading process whichis reasonable

It is still an open question what conditions added to 1198640andor 119903( 1199050) can guarantee the identifiability of the fullynonparametric 1198640 But in the special case of block networksone simple identifiability condition can be figured out Infact for block networks it is straightforward that (11986410 11986420)is identifiable if and only if there does not exist a (1198641 1198642)pair that differs from the true (11986410 11986420) but leads to the samelikelihood function (6) in the limit case if and only if forthe true 11986420 the vector space spanned by the family of vectorsV119905 119905 ge 1199050 is the entire feature space R119876 ie V119905 119905 ge 1199050has full rank 119876 is the number of blocks V119905 = (V1199051 V119905119876)⊤is a 119876-dimensional column vector for every 119905 and for each119902 = 1 119876 V119905119902 = intR119901 11986410119902(119909)119903(119909 119905)119889119865(119909) 11986410119902 is the 119902thentry of 11986410(119909) To reach the full rank condition the well-known Wronskian determinant [51] can be applied leadingto the following clean-form identifiability condition

det V1199050 diag (119888 minus V1199050) 11986420V1199050 (diag (119888 minus V1199050) 11986420)119876minus1sdot V1199050 = 0 (20)

where 119888 is the other 119876-dimensional column vector (1198881 119888119876)⊤ determined by the true 11986410 function such that 119888119902 =intR11990111986410119902(119909)119889119865(119909) for 119902 = 1 119876 diag is the operation that

convert a 119876-dimensional vector to a 119876 times 119876 matrix with itsdiagonal elements being the given vector By the polynomialnature of the determinant function it can be verified that (20)holds ldquogenericallyrdquo in the sense that the set of 1198642s that forces(20) to be constantly equal to 0 is contained in an 119876 times 119876 minus 1dimensional surface within [0 1]119876times119876 and for those 1198642s that(20) is not constantly 0 the set of V1199050 that forces (20) to be 0 isonly contained in a119876minus 1 dimensional surface within [0 1]119876Therefore (20) holds for almost all 1198642 and V1199050 except forsome extreme cases that have measure 0 under the standardLebesgue measure

The ldquoalmostrdquo identifiability for block networks guaranteesthat in most cases when the number of observed nodesis large and the distribution of observation time is densethe estimated 1198641 and 1198642 from the NPSML asymptoticallyconverge to their true values and point-wisely follow multi-variate normal distributions This asymptotic result followsstraightforwardly from Kristensen and Shin [29] Kukacka

Complexity 9

and Barunik [36] and the general properties of maximumlikelihood estimator So the theoretical validity of the esti-mators developed in previous sections is established

Remark 2 (sparsity) Although in general the complete iden-tifiability for both the general network and the block networkis hard to achieve but if we follow the idea in the networkreconstruction literature Shen et al [40] only concentrateon the case that the hidden network is as sparse as possiblein the sense the 1198712 norm of the edge weight function11986422 = intR119901timesR119901(119864(119909 119910))2119889119865(119909)119889119865(119910) for the general networkandor the entry-wise square sum of the block network119864222 = sum119894119895(1198642119894119895)2 (this is the 1198712 norm on the discreteset with cardinality 1198762) is as small as possible To automatethe selection of the sparsest network we can consider the1198712 norm function as a penalty and subtract it from thelog-likelihood function (6) and then optimizing (6) wouldguarantee the solution converging to the sparsest networkIt is easily verified that such a sparse solution is alwaysasymptotically unique because as we discussed in previousparagraphs all networks that can lead to exactly the samelog-likelihood function form a compact convex set in thefunctional space by the compactness and convexity therealways exists a unique 119864 (or 1198642) such that its 1198712-distance tothe origin reaches the minimum

4 Numerical Experiment with Synthetic Data

Two synthetic data sets are generated from simulation totest the effectiveness of the NPSML estimator designed inprevious sections one for the fully nonparametric networkand the other for the block network For both examples thenode set N consists of 200 nodes which are drawn purelyrandomly from the unit cube [0 1)2 thus these nodes followthe uniform distribution Consider the following modelsetup

Example 1 (full nonparametric network) Edge function 119864 isnegatively proportional to the standard Euclidean distancebetween two nodes ie

119864 (119909 119910) = 1 minus radic⟨119909 minus 119910 119909 minus 119910⟩2 (21)

Example 2 (block network) Set 119876 = 3 block membershipfunction 1198641 satisfies

1198641 (119909 119910) = (1 0 0) 119894119891 119909 + 1199102 lt 13 (0 1 0) 119894119891 13 ge 119909 + 1199102 lt 23 (0 0 1) 119890119897119904119890 (22)

Matrix 1198642 is given as follows

1198642 = ( 0 1 0508 0 03001 0 0 ) (23)

For both examples the spreading process is initializedas that 30 of all nodes are infected at the very beginningand the infected nodes are randomly picked from the nodeset The full spreading process is generated from a discreteversion of (2) with sufficiently small time step (eg 119889119905 = 001that makes the resulting distribution flows as the first-orderapproximation to the true flows) a coarse time step (119889119905 = 01)is used for the estimation procedure (9) in order to test therobustness The process is followed up until day 5 ie thetime horizon in this simulation study is [0 119905) with 119905 = 5The observation of the distribution flows is supposed to beavailable only at the initial time and the end of every day iethere are 6 chances to observe the distribution of infectionsat 119905 = 0 1 2 3 4 5

For the fully nonparametric Example 1 the spreadingprocess is regenerated for 100 times with 100 random initial-izations this is necessary to address the identification issuesas pointed out in Section 36 For the 100 trails both the nodeset and the initial infectious subset are regenerated althoughtheir distributions are held constant For the block networkExample 2 the spreading process is generated only once inorder to evaluate the fitting performance under the situationthat no repeated observation of the spreading process isavailable For both examples the estimated edge function isevaluated on afixed set of grids for easy comparisonwhere thegrid set forms a lattice of the unit cube ieG = (01119896 01119897) 119896 119897 = 0 1 10

If all nodes are included in the computation of theNPSML estimator there are in principle a 40000(= 200 times200)-dimensional parameter space for full nonparametricnetwork Example 1 and a 609(= 200times3+3times3)-dimensionalparameter space for block network Example 2 to be searchedwhich are too time consuming As in the introduction ofNPSML estimator by the smoothness of edge functionthe number of nodes actually used to evaluate the edgefunction can be much smaller than the size of the entirenode set So to reduce computation load we generate another1198721 = 20 nodes from the uniform distribution which will beused in Step 3 (Section 33) for simulating the distributionfunction 119903 Accordingly the 1198722 = 400 node pairs willbe selected as the product of the 20 nodes for the fullynonparametric Example 1 then there are 400 parameters tooptimize in Example 1 and the size is quite reasonable formost nonparametric tasks For the block network Example 2as no node pairs are needed for block networks there areonly 69(= 20 times 3 + 3 times 3) parameters to optimize As for theselection of kernel width ℎ1 ℎ2 and ℎ3 we set ℎ1 = 400minus15ℎ2 = 200minus13 and ℎ3 = 20minus13 This is because the kernelsmooth method requires kernel width ℎ to satisfy 119899ℎ119896 997888rarr infinand 119899ℎ119896+2 997888rarr 0 in order to guarantee the consistency andasymptotic normality [28 29 36 52] where 119899 is input samplesize and 119896 is the dimension of the data By a rule of thumbwe select the kernel width as ℎ = 119899minus1(119896+1) For ℎ1 it is onlyused in Example 1 to estimate the edge function where thesample size is1198722 = 400 and the data dimension is two timesof the dimension of node space thus 119896 is 4 For ℎ2 and ℎ3they are used in both examples for estimating the distributionfunction 119903 thus data dimension 119896 is always 2The sample size

10 Complexity

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

(a) Fitting accuracy for fully nonparametric network

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

00

00

(b) Fitting accuracy for block network

Figure 1 Fitting accuracy for networks in Examples 1 and 2

for ℎ2 is 200 because it is used to turn the real observed 119903 on200 nodes to its kernel smooth version and the sample size forℎ3 is 20 because it turns the estimated 119903 on 20 sampled nodesto its values on the full node set

For the inference of the block network the number ofblock119876 is usually not known in prior so it is also a parameterto estimate As119876 determines the model dimension we adoptthe classical Bayesian information criteria (BIC) introducedin Schwarz [53] to detect the correct model dimension Asdefined in Schwarz [53] the greater BIC for a fitted modelimplies the better explanatory power [53] therefore the bestchoice of119876 corresponds to the maximal BIC In practice it isnot possible to calculate the BIC value for all positive119876 so wefollow the convention and only compute the BIC on a smallset of 119876 isin 1 2 3 4 5 The 119876 associated with the maximalBIC and the corresponding estimates of 1198641 1198642 are selectedas the final estimators and reported in the following In ourexample the correct119876 = 3 is always achieved so we omit thistrivial result

In Figure 1 we plot the difference between the real edgefunction and the NPSML estimated edge function on the setG timesG of node pairs for both examples where the horizontalaxis represents the true value of edge weight on every nodepair and the vertical axis represents the estimated weighton the same node pair To facilitate visualization Figure 1is sorted according to the horizontal axis in an ascendingmannerThe red dots represent the pairs of (estimatedweighttrue weight) the blue line sketches the identity function 119910 =119909 therefore a red dot being closer to the blue line meansthe better fitting accuracy Apparently for most of nodepairs the difference is negligible To further verify this visualjudgement 1205942 test is carried out for every node pair (119909 119910) isinGtimesGwith the null hypothesis 119864119909119910 = (119864(119909 119910)minus119864(119909 119910))2 = 0Following the asymptotic normality of NPSML estimator 119864 atevery (119909 119910) the distribution of test statistics 1198641199091199101205902119909119910 undernull hypothesis should be a 1205942 distribution with degree offreedom 1 where 120590119909119910 is the asymptotic variance of estimator119864(119909 119910) which can be calculated by bootstrap method Wecount the number of node pairs that fail to support the nullhypothesis at 90 credential level the result shows that in

Table 1 Estimation accuracy of 1198642Entries Bias Std P value119864211 0021 0032 0468119864212 -0006 0012 0383119864213 -0003 0029 0057119864221 -0001 0029 0028119864222 0022 0022 066119864223 -0002 0028 0059119864231 0005 0024 0165119864232 0018 0029 048119864233 0016 0021 0554

both examples only less than 10 out of all 10000 evaluationpairs in G times G fail to support the null hypothesis So ourestimation accuracy is quite satisfactory which agrees withthe visualization in Figure 1

For the block network Example 2 Table 1 presents theentry-wise accuracy of estimated 1198642 relative to (23) thefirst column presents the estimation bias the second andthird columns are the empirical standard deviation and theempirical P-values of the estimates from which we canconclude that the fitting accuracy is relatively perfect

For robustness check we also consider the synthetic datagenerated for different 119889119905 isin 001 005 01 015 02 and theimplementation of NPSML estimation on node samples withdifferent size 1198721 and 1198722 When 1198721 and 1198722 are increasedto 100 and 10 000 respectively no significant difference canbe detected in terms of the estimation accuracy measured bythe entry-wise bias between the true and the estimated edgeweight so we omit to plot this result For the rejection ratio at90 credential level of the null hypothesis that the true andestimated edgeweight are identical this ratio is lowered downa bit for the block network to less than 6 but no significantdecreasing can be detected for the general network exampleThis observation might be caused by the fact that for generalnetwork there are much more free parameters to estimatewhich reduces the convergence speed As for the different119889119905 the variation of estimation accuracy is not significant in

Complexity 11

all aspects this fact agrees with the discussion in the end ofSection 34

5 Experiment with Rumor Spreadingon Twitter

To demonstrate the usefulness of the NPSML method inreal-world applications we carry out an experiment with thedistribution flow data of a real rumor spreading process onTwitter We collect a data set of tweet articles with regardto the famous event ldquoUnite the Right rallyrdquo The ldquoUnite theRight rallyrdquo also known as the Charlottesville rally or Char-lottesville riots was a white supremacist rally that occurredin Charlottesville Virginia from August 11 to 12 2017 Therally occurred amidst the backdrop of controversy generatedby the removal of Confederate monuments throughout thecountry in response to the Charleston church shootingin 2015 The event turned violent after protesters clashedwith counter-protesters leaving over 30 injured The rallyalso attracted wide attentions on Twitter Twitter users ledvigilante campaigns on the platforms to personally identifyand denounce individual marchers in the rally following thestart of the campaignmany of themarchers were shamed andvilified by the social media community with several of therally attendees being dismissed from their jobs as a result ofthe campaign

Although the rally occurred in Charlottesville originallymessages andor comments related to it are immediatelyspread out through Twitter to users in many other placesincluding all major cities in US which inspired subsequentvigils and demonstrations in a number of cities across thecountry in the following days from Aug 11 and 12 2017 Tothis event we collect a time series of user level information(during the time from Aug 11 to Sep 4 2017) that recordedall Twitter user accounts in 20+ cities that spread at leastonce any messagecomment related to the rally during thecollection period We also collect the reaction time of everyuser to relevant messages and the user-specific informationsuch as the number of followers friends that an user has andhow many tweets the user has published in the past (historyposts) In addition the registration location of the Twitteraccount and its corresponding latitude and longitude are alsocollected

Similar to most rumor spreading data it is not possible totrack how every single message is spread from user to user byour collected data thus there is no way to directly identifythe interaction network among users But it is possible togenerate the distribution flows of users who have joined thespreading process Formally we can define at each time point119905 that a user has joined the process if and only if by 119905 heshehas at least reacted once to the messagescomments relatedto the rally then the data set can be easily converted to day-by-day distribution flows where at every time (day) 119905 sincethe origin (Aug 11 2017) we have an 119873-dimensional 0 1-valued vector with119873 being the number of all users in recordThe 119894th coordinate takes value 1 if and only if the 119894th user hasreacted to the rally-message at least once by 119905

For such a distribution flow data set we are interestedin making inference of features of the interaction network

between users because they are useful for making predictionfor the other spreading processes on Twitter regarding thesimilar social events To that end we apply the NPSMLmethod to estimate the hidden interaction network from theflow data Since there are 100000+ users in our record andit is likely that many users belong to the same latent group sothat their response pattern is similar to their common groupmembers it is more appropriate to assume the interactionnetwork behind our flow data is a block network and thenapply the NPSML to the block network model discussed inSection 35

To uncover the dependence of interaction links betweenusers on their geographical features andor friendshipfol-lowership relation we embed nodes(users) of the interac-tion network into a 5-dimensional feature space with thecoordinates representing the latitude longitude of accountlocation the number of friends followers and history postsrespectively To reduce the computation burden we adopt thebootstrap method randomly pick 10000 users from the fullset of users for 10 times and estimate the block network oneach of the subsamples For every subsample an estimatorfor membership weight function 1198641 and interaction matrix1198642 can be derived The aggregated estimator for interactionmatrix 1198642 is averaged over all subsample estimators for theblock membership weight 1198641 the aggregated estimator isderived by maximum a posteriori from the set of subsampleestimators

For robustness check we select 119889119905 isin 001 005 01 02to solve (9) As block network is used there is no need todraw the1198722 samples of node pairs only1198721 sampled nodesare needed for evaluating 119903 To reduce computation burdenwe consider to take a much smaller 1198721 than the number ofall users in record (10000+) to approximate the membershipweight function 1198641 and distribution function 119903 To check therobustness of our estimation with respect to different choiceof1198721 we preliminarily run the estimation program on a setof different 1198721 isin 50 100 200 500 The feature vectorof the 1198721 nodes in each trail is selected by conducting aK-means clustering on the full sample with the number ofclusters equal to 1198721 then the set of cluster centres will beselected as the feature vector Such selected feature vectorfor the1198721 nodes distributes asymptotically in the same waywithin the feature space as for the full sample of nodes Thepreliminary result shows that the estimators are not sensitiveto different choice of 119889119905 and become stable when1198721 is greaterthan 50 Therefore we will fix 119889119905 = 02 and 1198721 = 100 the100 cluster centres are also used as the evaluation nodes forthe estimated function 1198641

The choice of best block number is still based on max-imization of BIC value We plot the BIC for the three casesthat the block number equals to 3 4 and 5 in Figure 2 andthe BIC reaches its maximum when block number is 4 so weconsider a block network with 4 blocks as the final model forfurther analysis

Different visualizations of the block network are pro-vided Figure 3 sketches the geographic range of everyblockcommunity of the Twitter network the amount offollowers friends and history posts is plotted along with

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 7: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

Complexity 7

119861 = cov (Δ (119895 119905) (119903 (119909119895 1199051015840) minus 119903 (119909119895 119905))sdot 119864 (119909119895 119909119897 (119895 1199051015840)) sdot 119903 (119909119897 (119895 1199051015840) 1199051015840) | 119903 (119909119895 119905119894))= cov (1 minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840)minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))sdot 119864 (119864 (119909119895 119909119897 (119895 119905)) 119903 (119909119897 (119895 119905) 119905))sdot 119864 (119864 (119909119895 119909119897 (119895 1199051015840)) 119903 (119909119897 (119895 1199051015840) 1199051015840)) le cov (1minus 119903 (119909119895 119905) 119903 (119909119895 1199051015840) minus 119903 (119909119895 119905) | 119903 (119909119895 119905119894))le 119864 (10038161003816100381610038161003816119903 (119909119895 1199051015840) minus 119903 (119909119895 119905)10038161003816100381610038161003816 | 119903 (119909119895 119905119894))le 10038171003817100381710038171003817100381710038171003817100381710038171003817119889119903 (119909119895 )119889119905 10038171003817100381710038171003817100381710038171003817100381710038171003817infin (119905 minus 119905119894) le (119905119894+1 minus 119905119894)

(14)

where sdot infin is the 119871infin norm of a bounded valued functionThe above inequality holds straightforwardly from the fact 119903is bounded by 1 and its temporal derivative is given by (1)which is also uniformly bounded by 1 then the statement(ii) follows immediately

Using Property (i) (ii) and the law of large number itis straightforward that the difference between the likelihoodfunction constructed from (9) and by (13) is bounded by aconstant multiple of Δ119905 as the number of nodes119872 997888rarr infinIf we further require Δ119905 997888rarr 0 along with 119872 997888rarr infin thetwo types of calculation of the likelihood function would beasymptotically identical which leads to the same estimator tothe hidden network

Also notice that by the fast algorithm the choice of 119889119905 isindependent with the estimation accuracy so in practice itcan be selected directly as 119905119894+1 minus 119905119894 to increase the speed35 Block Network The NPSML algorithm constructed inprevious section can be further extended to make inferencefor the block network model As in many applications [3338 39] the existence of connection between two agents isonly relevant to the groups they belong to and the features ofagents only affect which group they are assigned to Withoutloss of generality the set of 119876 groups can be considered as apartition of the set of all nodes then the edge function canbe decomposed as two components

(i) the group weight function 1198641 R119901 997888rarr [0 1]119876(ii) the group-level edge weight 1198642 which is a 119876 times 119876

matrix with each entry valued in [0 1]The edge function 119864 for the block network model can berecovered from (i) and (ii) as follows119864 (119909 119910) = 1198641 (119909)⊤ 11986421198641 (119910) (15)

where the image of 1198641 is viewed as a119876-dimensional columnsvector and the subscript ⊤ represents vector transpose The

group weight function is required to satisfy that for every 119909and 1198641(119909) = (1199041 119904119876) there exist only one 119894 isin 1 119876with 119904119894 gt 0 which means every node can only have positiveprobability to belong to at most one group which guaranteesthe requirement that groups constitute a partition of the nodeset

The estimation of block network is equivalent to theestimation of (1) the group weight function 1198641 which isunknown and consists of the fully nonparametric componentof the network and (2) the interaction matrix 1198642 which is theparametric component of the network So the estimation isessentially semiparametric The six-step algorithm discussedin Section 33 and the fast algorithm in Section 34 are stillapplicable to that case The only modification is for Step 3where the kernel smoothing method is no longer applied tothe unknown edge weight 119864 Instead it is applied to generatethe estimate to group weight 1198641 Then the hidden weightfunction 119864 is constructed from the kernel smoothed 1198641 andthe given interaction matrix 1198642 in the way of (15)

Block network model has many advantages For instancewhen the number of groups involved is small and does notdepend on the number of nodes the number of parametersto solve is only1198721119876+1198762 while the number is1198722 when thereis no block structure at all To generate good approximationto the true edge function 1198722 has to increase along withthe number 11987221 (although slowly) when the node numberin observation is giant 1198721 has to be large as well then1198722 ≫ 1198721119876 + 1198762 Through block network we can sharplyreduce the dimension of parameter space when solving themaximum likelihood problem which can significantly lift thecomputation efficiency

In addition block network is much easier to identifythan the general fully nonparametric networks which will bediscussed in the next section Finally under block networkthe equilibrium infectious distribution of the spreading pro-cess has a clear analytic expression as stated in the followingproposition (proof for Proposition 1 is quite trivial henceomitted)

Proposition 1 Denote 1198641119894 (119909) as the projection of vector 1198641(119909)to its 119894th coordinate Define G119894 = 119909 isin R119901 1198641119894 (119909) gt 0that consists of the set of nodes belonging to group 119894 thenwithin a mean-fieldmodel of the form (2) with edge function 119864given by (15) every equilibrium infection distribution 119903(119909) (iesatisfying (1 minus 119903(119909)) sdot int119901

R119864(119909 119910)119903(119910)119889119865(119910) equiv 0) must have the

following form119903 (119909)= 0 119894119891 119909 isin G119894 P119894 (1198642)119899 119903 (119910 1199050) equiv 0 119891119900119903 119886119897119897 119910 119899 gt 01 119890119897119904119890 (16)

where 119903(119910 1199050) is the prescribed initial distribution of infectiousstatusP119894 is the projection of a vector to its 119894th dimension and(1198642)119899 denotes the 119899th power of matrix 1198642

Proposition 1 is meaningful in the sense that it links thetypes of equilibria infectious distribution with the matrix

8 Complexity

algebra facilitating the qualitative analysis of the equilibriadistribution For instance when 1198642 is an upper trianglematrix with all its lower off-diagonal entries being zero andall diagonal and upper off-diagonal entries being strictlypositive such as in (17)

(((((

119909 119909 119909 sdot sdot sdot 1199090 119909 119909 d0 sdot sdot sdot 119909 sdot sdot sdot 119909 d 0 119909 1199090 sdot sdot sdot 0 0 119909)))))

(17)

then the equilibriumdistribution 119903 and the initial distribution119903( 1199050) satisfy the relation119903 (119909) = 1 iff 119909 isin 1198761015840⋃

119894=1

G119894 lArrrArr119903 (119909 1199050) gt 0 iff 119909 isin 119876⋃

119894=1198761015840+1

G119894

(18)

36 Validity of NPSML Due to the nonparametric natureof the edge function 119864 its identifiability is tricky When thespreading process can be observed for multiple times (119898times) with random initializations and 119898 is large as assumedin Roudi and Hertz [41] Shen et al [40] both of the fullynonparametric network 119864 and the block network (1198641 1198642)are identifiable However in real applications a spreadingprocess can at most be observed for a few times it is notexpected that 119898 can be very large In that case the fullynonparametric edge function 119864 is no longer fully identifiableie there exists 119864 = 1198641015840 that leads to the same likelihoodfunction (6) in the limit case However it can be shownthat 119864 is identifiable up to compact convex set ie the setS1198640119903(1199050)119864 119871(O119905 119864) = 119871(O119905 1198640) is a compact convex setwithin the function space 1198712(R119901 times R119901) where 1198640 stands forthe true value of edge function It can also be proved that thesetS1198640119903(1199050) also varies along with the initial infectious status119903( 1199050) Formally we have that 119864 isin S1198640119903(1199050) if and only if thefollowing holds for all 119899 = 1 (M1minus119903(1199050)K119864)119899 119903 ( 1199050) equiv (M1minus119903(1199050)K1198640)119899 119903 ( 1199050) (19)

where K119864 is a bounded operator over the functionalspace 1198712(R119901 defined through 119864 as (K119864119892)(119909) fl int

R119901119864(119909119910)119892(119910)119889119865(119910) for every 119892 isin 1198712(R119901) with 119865 being the

default node distribution M119891 is the multiplicative operatordetermined by 119891 such that (M119891119892)(119909) = 119891(119909) sdot 119892(119909) the 119899thpower in (19) represents the self-composition of an operatorfor 119899 times (19) implies that the identifiability of the true edgefunction 1198640 is limited by the extent of the ergodicity of thespreading process within the node space R119901 For instancewhen there exists a small open set 119880 sub R119901 such that allnodes 119909 isin 119880 are infected before the initial time 1199050 ie119903(119909 1199050) equiv 1 for all 119909 isin 119880 then it can be verified by (19)

that all functions 119864 that deviate from 1198640 only within the bandset 119880 times R119901 are contained in S1198640 On the other hand if thereexists open 1198801015840 sub R119901 such that (M1minus119903(1199050)K1198640)119899119903(119909 1199050) equiv 0for all 119909 isin 1198801015840 and all 119899 then all functions 119864 that deviatefrom 1198640 only within 1198801015840 times 1198801015840 are contained in S1198640119903(1199050) Inboth of the two cases nodes in 119880 or 1198801015840 are not in the ergodicrange of the spreading process hence the transmission oftheir infectious status is not observable For nodes in119880 theirinfections occur ahead of the observation period hence notobservable after the start of spreading while for nodes in 1198801015840it can be verified that they will never be infected over theentire spreading processTherefore the identifiability of 1198640 isrestricted by the experience of the spreading process whichis reasonable

It is still an open question what conditions added to 1198640andor 119903( 1199050) can guarantee the identifiability of the fullynonparametric 1198640 But in the special case of block networksone simple identifiability condition can be figured out Infact for block networks it is straightforward that (11986410 11986420)is identifiable if and only if there does not exist a (1198641 1198642)pair that differs from the true (11986410 11986420) but leads to the samelikelihood function (6) in the limit case if and only if forthe true 11986420 the vector space spanned by the family of vectorsV119905 119905 ge 1199050 is the entire feature space R119876 ie V119905 119905 ge 1199050has full rank 119876 is the number of blocks V119905 = (V1199051 V119905119876)⊤is a 119876-dimensional column vector for every 119905 and for each119902 = 1 119876 V119905119902 = intR119901 11986410119902(119909)119903(119909 119905)119889119865(119909) 11986410119902 is the 119902thentry of 11986410(119909) To reach the full rank condition the well-known Wronskian determinant [51] can be applied leadingto the following clean-form identifiability condition

det V1199050 diag (119888 minus V1199050) 11986420V1199050 (diag (119888 minus V1199050) 11986420)119876minus1sdot V1199050 = 0 (20)

where 119888 is the other 119876-dimensional column vector (1198881 119888119876)⊤ determined by the true 11986410 function such that 119888119902 =intR11990111986410119902(119909)119889119865(119909) for 119902 = 1 119876 diag is the operation that

convert a 119876-dimensional vector to a 119876 times 119876 matrix with itsdiagonal elements being the given vector By the polynomialnature of the determinant function it can be verified that (20)holds ldquogenericallyrdquo in the sense that the set of 1198642s that forces(20) to be constantly equal to 0 is contained in an 119876 times 119876 minus 1dimensional surface within [0 1]119876times119876 and for those 1198642s that(20) is not constantly 0 the set of V1199050 that forces (20) to be 0 isonly contained in a119876minus 1 dimensional surface within [0 1]119876Therefore (20) holds for almost all 1198642 and V1199050 except forsome extreme cases that have measure 0 under the standardLebesgue measure

The ldquoalmostrdquo identifiability for block networks guaranteesthat in most cases when the number of observed nodesis large and the distribution of observation time is densethe estimated 1198641 and 1198642 from the NPSML asymptoticallyconverge to their true values and point-wisely follow multi-variate normal distributions This asymptotic result followsstraightforwardly from Kristensen and Shin [29] Kukacka

Complexity 9

and Barunik [36] and the general properties of maximumlikelihood estimator So the theoretical validity of the esti-mators developed in previous sections is established

Remark 2 (sparsity) Although in general the complete iden-tifiability for both the general network and the block networkis hard to achieve but if we follow the idea in the networkreconstruction literature Shen et al [40] only concentrateon the case that the hidden network is as sparse as possiblein the sense the 1198712 norm of the edge weight function11986422 = intR119901timesR119901(119864(119909 119910))2119889119865(119909)119889119865(119910) for the general networkandor the entry-wise square sum of the block network119864222 = sum119894119895(1198642119894119895)2 (this is the 1198712 norm on the discreteset with cardinality 1198762) is as small as possible To automatethe selection of the sparsest network we can consider the1198712 norm function as a penalty and subtract it from thelog-likelihood function (6) and then optimizing (6) wouldguarantee the solution converging to the sparsest networkIt is easily verified that such a sparse solution is alwaysasymptotically unique because as we discussed in previousparagraphs all networks that can lead to exactly the samelog-likelihood function form a compact convex set in thefunctional space by the compactness and convexity therealways exists a unique 119864 (or 1198642) such that its 1198712-distance tothe origin reaches the minimum

4 Numerical Experiment with Synthetic Data

Two synthetic data sets are generated from simulation totest the effectiveness of the NPSML estimator designed inprevious sections one for the fully nonparametric networkand the other for the block network For both examples thenode set N consists of 200 nodes which are drawn purelyrandomly from the unit cube [0 1)2 thus these nodes followthe uniform distribution Consider the following modelsetup

Example 1 (full nonparametric network) Edge function 119864 isnegatively proportional to the standard Euclidean distancebetween two nodes ie

119864 (119909 119910) = 1 minus radic⟨119909 minus 119910 119909 minus 119910⟩2 (21)

Example 2 (block network) Set 119876 = 3 block membershipfunction 1198641 satisfies

1198641 (119909 119910) = (1 0 0) 119894119891 119909 + 1199102 lt 13 (0 1 0) 119894119891 13 ge 119909 + 1199102 lt 23 (0 0 1) 119890119897119904119890 (22)

Matrix 1198642 is given as follows

1198642 = ( 0 1 0508 0 03001 0 0 ) (23)

For both examples the spreading process is initializedas that 30 of all nodes are infected at the very beginningand the infected nodes are randomly picked from the nodeset The full spreading process is generated from a discreteversion of (2) with sufficiently small time step (eg 119889119905 = 001that makes the resulting distribution flows as the first-orderapproximation to the true flows) a coarse time step (119889119905 = 01)is used for the estimation procedure (9) in order to test therobustness The process is followed up until day 5 ie thetime horizon in this simulation study is [0 119905) with 119905 = 5The observation of the distribution flows is supposed to beavailable only at the initial time and the end of every day iethere are 6 chances to observe the distribution of infectionsat 119905 = 0 1 2 3 4 5

For the fully nonparametric Example 1 the spreadingprocess is regenerated for 100 times with 100 random initial-izations this is necessary to address the identification issuesas pointed out in Section 36 For the 100 trails both the nodeset and the initial infectious subset are regenerated althoughtheir distributions are held constant For the block networkExample 2 the spreading process is generated only once inorder to evaluate the fitting performance under the situationthat no repeated observation of the spreading process isavailable For both examples the estimated edge function isevaluated on afixed set of grids for easy comparisonwhere thegrid set forms a lattice of the unit cube ieG = (01119896 01119897) 119896 119897 = 0 1 10

If all nodes are included in the computation of theNPSML estimator there are in principle a 40000(= 200 times200)-dimensional parameter space for full nonparametricnetwork Example 1 and a 609(= 200times3+3times3)-dimensionalparameter space for block network Example 2 to be searchedwhich are too time consuming As in the introduction ofNPSML estimator by the smoothness of edge functionthe number of nodes actually used to evaluate the edgefunction can be much smaller than the size of the entirenode set So to reduce computation load we generate another1198721 = 20 nodes from the uniform distribution which will beused in Step 3 (Section 33) for simulating the distributionfunction 119903 Accordingly the 1198722 = 400 node pairs willbe selected as the product of the 20 nodes for the fullynonparametric Example 1 then there are 400 parameters tooptimize in Example 1 and the size is quite reasonable formost nonparametric tasks For the block network Example 2as no node pairs are needed for block networks there areonly 69(= 20 times 3 + 3 times 3) parameters to optimize As for theselection of kernel width ℎ1 ℎ2 and ℎ3 we set ℎ1 = 400minus15ℎ2 = 200minus13 and ℎ3 = 20minus13 This is because the kernelsmooth method requires kernel width ℎ to satisfy 119899ℎ119896 997888rarr infinand 119899ℎ119896+2 997888rarr 0 in order to guarantee the consistency andasymptotic normality [28 29 36 52] where 119899 is input samplesize and 119896 is the dimension of the data By a rule of thumbwe select the kernel width as ℎ = 119899minus1(119896+1) For ℎ1 it is onlyused in Example 1 to estimate the edge function where thesample size is1198722 = 400 and the data dimension is two timesof the dimension of node space thus 119896 is 4 For ℎ2 and ℎ3they are used in both examples for estimating the distributionfunction 119903 thus data dimension 119896 is always 2The sample size

10 Complexity

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

(a) Fitting accuracy for fully nonparametric network

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

00

00

(b) Fitting accuracy for block network

Figure 1 Fitting accuracy for networks in Examples 1 and 2

for ℎ2 is 200 because it is used to turn the real observed 119903 on200 nodes to its kernel smooth version and the sample size forℎ3 is 20 because it turns the estimated 119903 on 20 sampled nodesto its values on the full node set

For the inference of the block network the number ofblock119876 is usually not known in prior so it is also a parameterto estimate As119876 determines the model dimension we adoptthe classical Bayesian information criteria (BIC) introducedin Schwarz [53] to detect the correct model dimension Asdefined in Schwarz [53] the greater BIC for a fitted modelimplies the better explanatory power [53] therefore the bestchoice of119876 corresponds to the maximal BIC In practice it isnot possible to calculate the BIC value for all positive119876 so wefollow the convention and only compute the BIC on a smallset of 119876 isin 1 2 3 4 5 The 119876 associated with the maximalBIC and the corresponding estimates of 1198641 1198642 are selectedas the final estimators and reported in the following In ourexample the correct119876 = 3 is always achieved so we omit thistrivial result

In Figure 1 we plot the difference between the real edgefunction and the NPSML estimated edge function on the setG timesG of node pairs for both examples where the horizontalaxis represents the true value of edge weight on every nodepair and the vertical axis represents the estimated weighton the same node pair To facilitate visualization Figure 1is sorted according to the horizontal axis in an ascendingmannerThe red dots represent the pairs of (estimatedweighttrue weight) the blue line sketches the identity function 119910 =119909 therefore a red dot being closer to the blue line meansthe better fitting accuracy Apparently for most of nodepairs the difference is negligible To further verify this visualjudgement 1205942 test is carried out for every node pair (119909 119910) isinGtimesGwith the null hypothesis 119864119909119910 = (119864(119909 119910)minus119864(119909 119910))2 = 0Following the asymptotic normality of NPSML estimator 119864 atevery (119909 119910) the distribution of test statistics 1198641199091199101205902119909119910 undernull hypothesis should be a 1205942 distribution with degree offreedom 1 where 120590119909119910 is the asymptotic variance of estimator119864(119909 119910) which can be calculated by bootstrap method Wecount the number of node pairs that fail to support the nullhypothesis at 90 credential level the result shows that in

Table 1 Estimation accuracy of 1198642Entries Bias Std P value119864211 0021 0032 0468119864212 -0006 0012 0383119864213 -0003 0029 0057119864221 -0001 0029 0028119864222 0022 0022 066119864223 -0002 0028 0059119864231 0005 0024 0165119864232 0018 0029 048119864233 0016 0021 0554

both examples only less than 10 out of all 10000 evaluationpairs in G times G fail to support the null hypothesis So ourestimation accuracy is quite satisfactory which agrees withthe visualization in Figure 1

For the block network Example 2 Table 1 presents theentry-wise accuracy of estimated 1198642 relative to (23) thefirst column presents the estimation bias the second andthird columns are the empirical standard deviation and theempirical P-values of the estimates from which we canconclude that the fitting accuracy is relatively perfect

For robustness check we also consider the synthetic datagenerated for different 119889119905 isin 001 005 01 015 02 and theimplementation of NPSML estimation on node samples withdifferent size 1198721 and 1198722 When 1198721 and 1198722 are increasedto 100 and 10 000 respectively no significant difference canbe detected in terms of the estimation accuracy measured bythe entry-wise bias between the true and the estimated edgeweight so we omit to plot this result For the rejection ratio at90 credential level of the null hypothesis that the true andestimated edgeweight are identical this ratio is lowered downa bit for the block network to less than 6 but no significantdecreasing can be detected for the general network exampleThis observation might be caused by the fact that for generalnetwork there are much more free parameters to estimatewhich reduces the convergence speed As for the different119889119905 the variation of estimation accuracy is not significant in

Complexity 11

all aspects this fact agrees with the discussion in the end ofSection 34

5 Experiment with Rumor Spreadingon Twitter

To demonstrate the usefulness of the NPSML method inreal-world applications we carry out an experiment with thedistribution flow data of a real rumor spreading process onTwitter We collect a data set of tweet articles with regardto the famous event ldquoUnite the Right rallyrdquo The ldquoUnite theRight rallyrdquo also known as the Charlottesville rally or Char-lottesville riots was a white supremacist rally that occurredin Charlottesville Virginia from August 11 to 12 2017 Therally occurred amidst the backdrop of controversy generatedby the removal of Confederate monuments throughout thecountry in response to the Charleston church shootingin 2015 The event turned violent after protesters clashedwith counter-protesters leaving over 30 injured The rallyalso attracted wide attentions on Twitter Twitter users ledvigilante campaigns on the platforms to personally identifyand denounce individual marchers in the rally following thestart of the campaignmany of themarchers were shamed andvilified by the social media community with several of therally attendees being dismissed from their jobs as a result ofthe campaign

Although the rally occurred in Charlottesville originallymessages andor comments related to it are immediatelyspread out through Twitter to users in many other placesincluding all major cities in US which inspired subsequentvigils and demonstrations in a number of cities across thecountry in the following days from Aug 11 and 12 2017 Tothis event we collect a time series of user level information(during the time from Aug 11 to Sep 4 2017) that recordedall Twitter user accounts in 20+ cities that spread at leastonce any messagecomment related to the rally during thecollection period We also collect the reaction time of everyuser to relevant messages and the user-specific informationsuch as the number of followers friends that an user has andhow many tweets the user has published in the past (historyposts) In addition the registration location of the Twitteraccount and its corresponding latitude and longitude are alsocollected

Similar to most rumor spreading data it is not possible totrack how every single message is spread from user to user byour collected data thus there is no way to directly identifythe interaction network among users But it is possible togenerate the distribution flows of users who have joined thespreading process Formally we can define at each time point119905 that a user has joined the process if and only if by 119905 heshehas at least reacted once to the messagescomments relatedto the rally then the data set can be easily converted to day-by-day distribution flows where at every time (day) 119905 sincethe origin (Aug 11 2017) we have an 119873-dimensional 0 1-valued vector with119873 being the number of all users in recordThe 119894th coordinate takes value 1 if and only if the 119894th user hasreacted to the rally-message at least once by 119905

For such a distribution flow data set we are interestedin making inference of features of the interaction network

between users because they are useful for making predictionfor the other spreading processes on Twitter regarding thesimilar social events To that end we apply the NPSMLmethod to estimate the hidden interaction network from theflow data Since there are 100000+ users in our record andit is likely that many users belong to the same latent group sothat their response pattern is similar to their common groupmembers it is more appropriate to assume the interactionnetwork behind our flow data is a block network and thenapply the NPSML to the block network model discussed inSection 35

To uncover the dependence of interaction links betweenusers on their geographical features andor friendshipfol-lowership relation we embed nodes(users) of the interac-tion network into a 5-dimensional feature space with thecoordinates representing the latitude longitude of accountlocation the number of friends followers and history postsrespectively To reduce the computation burden we adopt thebootstrap method randomly pick 10000 users from the fullset of users for 10 times and estimate the block network oneach of the subsamples For every subsample an estimatorfor membership weight function 1198641 and interaction matrix1198642 can be derived The aggregated estimator for interactionmatrix 1198642 is averaged over all subsample estimators for theblock membership weight 1198641 the aggregated estimator isderived by maximum a posteriori from the set of subsampleestimators

For robustness check we select 119889119905 isin 001 005 01 02to solve (9) As block network is used there is no need todraw the1198722 samples of node pairs only1198721 sampled nodesare needed for evaluating 119903 To reduce computation burdenwe consider to take a much smaller 1198721 than the number ofall users in record (10000+) to approximate the membershipweight function 1198641 and distribution function 119903 To check therobustness of our estimation with respect to different choiceof1198721 we preliminarily run the estimation program on a setof different 1198721 isin 50 100 200 500 The feature vectorof the 1198721 nodes in each trail is selected by conducting aK-means clustering on the full sample with the number ofclusters equal to 1198721 then the set of cluster centres will beselected as the feature vector Such selected feature vectorfor the1198721 nodes distributes asymptotically in the same waywithin the feature space as for the full sample of nodes Thepreliminary result shows that the estimators are not sensitiveto different choice of 119889119905 and become stable when1198721 is greaterthan 50 Therefore we will fix 119889119905 = 02 and 1198721 = 100 the100 cluster centres are also used as the evaluation nodes forthe estimated function 1198641

The choice of best block number is still based on max-imization of BIC value We plot the BIC for the three casesthat the block number equals to 3 4 and 5 in Figure 2 andthe BIC reaches its maximum when block number is 4 so weconsider a block network with 4 blocks as the final model forfurther analysis

Different visualizations of the block network are pro-vided Figure 3 sketches the geographic range of everyblockcommunity of the Twitter network the amount offollowers friends and history posts is plotted along with

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 8: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

8 Complexity

algebra facilitating the qualitative analysis of the equilibriadistribution For instance when 1198642 is an upper trianglematrix with all its lower off-diagonal entries being zero andall diagonal and upper off-diagonal entries being strictlypositive such as in (17)

(((((

119909 119909 119909 sdot sdot sdot 1199090 119909 119909 d0 sdot sdot sdot 119909 sdot sdot sdot 119909 d 0 119909 1199090 sdot sdot sdot 0 0 119909)))))

(17)

then the equilibriumdistribution 119903 and the initial distribution119903( 1199050) satisfy the relation119903 (119909) = 1 iff 119909 isin 1198761015840⋃

119894=1

G119894 lArrrArr119903 (119909 1199050) gt 0 iff 119909 isin 119876⋃

119894=1198761015840+1

G119894

(18)

36 Validity of NPSML Due to the nonparametric natureof the edge function 119864 its identifiability is tricky When thespreading process can be observed for multiple times (119898times) with random initializations and 119898 is large as assumedin Roudi and Hertz [41] Shen et al [40] both of the fullynonparametric network 119864 and the block network (1198641 1198642)are identifiable However in real applications a spreadingprocess can at most be observed for a few times it is notexpected that 119898 can be very large In that case the fullynonparametric edge function 119864 is no longer fully identifiableie there exists 119864 = 1198641015840 that leads to the same likelihoodfunction (6) in the limit case However it can be shownthat 119864 is identifiable up to compact convex set ie the setS1198640119903(1199050)119864 119871(O119905 119864) = 119871(O119905 1198640) is a compact convex setwithin the function space 1198712(R119901 times R119901) where 1198640 stands forthe true value of edge function It can also be proved that thesetS1198640119903(1199050) also varies along with the initial infectious status119903( 1199050) Formally we have that 119864 isin S1198640119903(1199050) if and only if thefollowing holds for all 119899 = 1 (M1minus119903(1199050)K119864)119899 119903 ( 1199050) equiv (M1minus119903(1199050)K1198640)119899 119903 ( 1199050) (19)

where K119864 is a bounded operator over the functionalspace 1198712(R119901 defined through 119864 as (K119864119892)(119909) fl int

R119901119864(119909119910)119892(119910)119889119865(119910) for every 119892 isin 1198712(R119901) with 119865 being the

default node distribution M119891 is the multiplicative operatordetermined by 119891 such that (M119891119892)(119909) = 119891(119909) sdot 119892(119909) the 119899thpower in (19) represents the self-composition of an operatorfor 119899 times (19) implies that the identifiability of the true edgefunction 1198640 is limited by the extent of the ergodicity of thespreading process within the node space R119901 For instancewhen there exists a small open set 119880 sub R119901 such that allnodes 119909 isin 119880 are infected before the initial time 1199050 ie119903(119909 1199050) equiv 1 for all 119909 isin 119880 then it can be verified by (19)

that all functions 119864 that deviate from 1198640 only within the bandset 119880 times R119901 are contained in S1198640 On the other hand if thereexists open 1198801015840 sub R119901 such that (M1minus119903(1199050)K1198640)119899119903(119909 1199050) equiv 0for all 119909 isin 1198801015840 and all 119899 then all functions 119864 that deviatefrom 1198640 only within 1198801015840 times 1198801015840 are contained in S1198640119903(1199050) Inboth of the two cases nodes in 119880 or 1198801015840 are not in the ergodicrange of the spreading process hence the transmission oftheir infectious status is not observable For nodes in119880 theirinfections occur ahead of the observation period hence notobservable after the start of spreading while for nodes in 1198801015840it can be verified that they will never be infected over theentire spreading processTherefore the identifiability of 1198640 isrestricted by the experience of the spreading process whichis reasonable

It is still an open question what conditions added to 1198640andor 119903( 1199050) can guarantee the identifiability of the fullynonparametric 1198640 But in the special case of block networksone simple identifiability condition can be figured out Infact for block networks it is straightforward that (11986410 11986420)is identifiable if and only if there does not exist a (1198641 1198642)pair that differs from the true (11986410 11986420) but leads to the samelikelihood function (6) in the limit case if and only if forthe true 11986420 the vector space spanned by the family of vectorsV119905 119905 ge 1199050 is the entire feature space R119876 ie V119905 119905 ge 1199050has full rank 119876 is the number of blocks V119905 = (V1199051 V119905119876)⊤is a 119876-dimensional column vector for every 119905 and for each119902 = 1 119876 V119905119902 = intR119901 11986410119902(119909)119903(119909 119905)119889119865(119909) 11986410119902 is the 119902thentry of 11986410(119909) To reach the full rank condition the well-known Wronskian determinant [51] can be applied leadingto the following clean-form identifiability condition

det V1199050 diag (119888 minus V1199050) 11986420V1199050 (diag (119888 minus V1199050) 11986420)119876minus1sdot V1199050 = 0 (20)

where 119888 is the other 119876-dimensional column vector (1198881 119888119876)⊤ determined by the true 11986410 function such that 119888119902 =intR11990111986410119902(119909)119889119865(119909) for 119902 = 1 119876 diag is the operation that

convert a 119876-dimensional vector to a 119876 times 119876 matrix with itsdiagonal elements being the given vector By the polynomialnature of the determinant function it can be verified that (20)holds ldquogenericallyrdquo in the sense that the set of 1198642s that forces(20) to be constantly equal to 0 is contained in an 119876 times 119876 minus 1dimensional surface within [0 1]119876times119876 and for those 1198642s that(20) is not constantly 0 the set of V1199050 that forces (20) to be 0 isonly contained in a119876minus 1 dimensional surface within [0 1]119876Therefore (20) holds for almost all 1198642 and V1199050 except forsome extreme cases that have measure 0 under the standardLebesgue measure

The ldquoalmostrdquo identifiability for block networks guaranteesthat in most cases when the number of observed nodesis large and the distribution of observation time is densethe estimated 1198641 and 1198642 from the NPSML asymptoticallyconverge to their true values and point-wisely follow multi-variate normal distributions This asymptotic result followsstraightforwardly from Kristensen and Shin [29] Kukacka

Complexity 9

and Barunik [36] and the general properties of maximumlikelihood estimator So the theoretical validity of the esti-mators developed in previous sections is established

Remark 2 (sparsity) Although in general the complete iden-tifiability for both the general network and the block networkis hard to achieve but if we follow the idea in the networkreconstruction literature Shen et al [40] only concentrateon the case that the hidden network is as sparse as possiblein the sense the 1198712 norm of the edge weight function11986422 = intR119901timesR119901(119864(119909 119910))2119889119865(119909)119889119865(119910) for the general networkandor the entry-wise square sum of the block network119864222 = sum119894119895(1198642119894119895)2 (this is the 1198712 norm on the discreteset with cardinality 1198762) is as small as possible To automatethe selection of the sparsest network we can consider the1198712 norm function as a penalty and subtract it from thelog-likelihood function (6) and then optimizing (6) wouldguarantee the solution converging to the sparsest networkIt is easily verified that such a sparse solution is alwaysasymptotically unique because as we discussed in previousparagraphs all networks that can lead to exactly the samelog-likelihood function form a compact convex set in thefunctional space by the compactness and convexity therealways exists a unique 119864 (or 1198642) such that its 1198712-distance tothe origin reaches the minimum

4 Numerical Experiment with Synthetic Data

Two synthetic data sets are generated from simulation totest the effectiveness of the NPSML estimator designed inprevious sections one for the fully nonparametric networkand the other for the block network For both examples thenode set N consists of 200 nodes which are drawn purelyrandomly from the unit cube [0 1)2 thus these nodes followthe uniform distribution Consider the following modelsetup

Example 1 (full nonparametric network) Edge function 119864 isnegatively proportional to the standard Euclidean distancebetween two nodes ie

119864 (119909 119910) = 1 minus radic⟨119909 minus 119910 119909 minus 119910⟩2 (21)

Example 2 (block network) Set 119876 = 3 block membershipfunction 1198641 satisfies

1198641 (119909 119910) = (1 0 0) 119894119891 119909 + 1199102 lt 13 (0 1 0) 119894119891 13 ge 119909 + 1199102 lt 23 (0 0 1) 119890119897119904119890 (22)

Matrix 1198642 is given as follows

1198642 = ( 0 1 0508 0 03001 0 0 ) (23)

For both examples the spreading process is initializedas that 30 of all nodes are infected at the very beginningand the infected nodes are randomly picked from the nodeset The full spreading process is generated from a discreteversion of (2) with sufficiently small time step (eg 119889119905 = 001that makes the resulting distribution flows as the first-orderapproximation to the true flows) a coarse time step (119889119905 = 01)is used for the estimation procedure (9) in order to test therobustness The process is followed up until day 5 ie thetime horizon in this simulation study is [0 119905) with 119905 = 5The observation of the distribution flows is supposed to beavailable only at the initial time and the end of every day iethere are 6 chances to observe the distribution of infectionsat 119905 = 0 1 2 3 4 5

For the fully nonparametric Example 1 the spreadingprocess is regenerated for 100 times with 100 random initial-izations this is necessary to address the identification issuesas pointed out in Section 36 For the 100 trails both the nodeset and the initial infectious subset are regenerated althoughtheir distributions are held constant For the block networkExample 2 the spreading process is generated only once inorder to evaluate the fitting performance under the situationthat no repeated observation of the spreading process isavailable For both examples the estimated edge function isevaluated on afixed set of grids for easy comparisonwhere thegrid set forms a lattice of the unit cube ieG = (01119896 01119897) 119896 119897 = 0 1 10

If all nodes are included in the computation of theNPSML estimator there are in principle a 40000(= 200 times200)-dimensional parameter space for full nonparametricnetwork Example 1 and a 609(= 200times3+3times3)-dimensionalparameter space for block network Example 2 to be searchedwhich are too time consuming As in the introduction ofNPSML estimator by the smoothness of edge functionthe number of nodes actually used to evaluate the edgefunction can be much smaller than the size of the entirenode set So to reduce computation load we generate another1198721 = 20 nodes from the uniform distribution which will beused in Step 3 (Section 33) for simulating the distributionfunction 119903 Accordingly the 1198722 = 400 node pairs willbe selected as the product of the 20 nodes for the fullynonparametric Example 1 then there are 400 parameters tooptimize in Example 1 and the size is quite reasonable formost nonparametric tasks For the block network Example 2as no node pairs are needed for block networks there areonly 69(= 20 times 3 + 3 times 3) parameters to optimize As for theselection of kernel width ℎ1 ℎ2 and ℎ3 we set ℎ1 = 400minus15ℎ2 = 200minus13 and ℎ3 = 20minus13 This is because the kernelsmooth method requires kernel width ℎ to satisfy 119899ℎ119896 997888rarr infinand 119899ℎ119896+2 997888rarr 0 in order to guarantee the consistency andasymptotic normality [28 29 36 52] where 119899 is input samplesize and 119896 is the dimension of the data By a rule of thumbwe select the kernel width as ℎ = 119899minus1(119896+1) For ℎ1 it is onlyused in Example 1 to estimate the edge function where thesample size is1198722 = 400 and the data dimension is two timesof the dimension of node space thus 119896 is 4 For ℎ2 and ℎ3they are used in both examples for estimating the distributionfunction 119903 thus data dimension 119896 is always 2The sample size

10 Complexity

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

(a) Fitting accuracy for fully nonparametric network

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

00

00

(b) Fitting accuracy for block network

Figure 1 Fitting accuracy for networks in Examples 1 and 2

for ℎ2 is 200 because it is used to turn the real observed 119903 on200 nodes to its kernel smooth version and the sample size forℎ3 is 20 because it turns the estimated 119903 on 20 sampled nodesto its values on the full node set

For the inference of the block network the number ofblock119876 is usually not known in prior so it is also a parameterto estimate As119876 determines the model dimension we adoptthe classical Bayesian information criteria (BIC) introducedin Schwarz [53] to detect the correct model dimension Asdefined in Schwarz [53] the greater BIC for a fitted modelimplies the better explanatory power [53] therefore the bestchoice of119876 corresponds to the maximal BIC In practice it isnot possible to calculate the BIC value for all positive119876 so wefollow the convention and only compute the BIC on a smallset of 119876 isin 1 2 3 4 5 The 119876 associated with the maximalBIC and the corresponding estimates of 1198641 1198642 are selectedas the final estimators and reported in the following In ourexample the correct119876 = 3 is always achieved so we omit thistrivial result

In Figure 1 we plot the difference between the real edgefunction and the NPSML estimated edge function on the setG timesG of node pairs for both examples where the horizontalaxis represents the true value of edge weight on every nodepair and the vertical axis represents the estimated weighton the same node pair To facilitate visualization Figure 1is sorted according to the horizontal axis in an ascendingmannerThe red dots represent the pairs of (estimatedweighttrue weight) the blue line sketches the identity function 119910 =119909 therefore a red dot being closer to the blue line meansthe better fitting accuracy Apparently for most of nodepairs the difference is negligible To further verify this visualjudgement 1205942 test is carried out for every node pair (119909 119910) isinGtimesGwith the null hypothesis 119864119909119910 = (119864(119909 119910)minus119864(119909 119910))2 = 0Following the asymptotic normality of NPSML estimator 119864 atevery (119909 119910) the distribution of test statistics 1198641199091199101205902119909119910 undernull hypothesis should be a 1205942 distribution with degree offreedom 1 where 120590119909119910 is the asymptotic variance of estimator119864(119909 119910) which can be calculated by bootstrap method Wecount the number of node pairs that fail to support the nullhypothesis at 90 credential level the result shows that in

Table 1 Estimation accuracy of 1198642Entries Bias Std P value119864211 0021 0032 0468119864212 -0006 0012 0383119864213 -0003 0029 0057119864221 -0001 0029 0028119864222 0022 0022 066119864223 -0002 0028 0059119864231 0005 0024 0165119864232 0018 0029 048119864233 0016 0021 0554

both examples only less than 10 out of all 10000 evaluationpairs in G times G fail to support the null hypothesis So ourestimation accuracy is quite satisfactory which agrees withthe visualization in Figure 1

For the block network Example 2 Table 1 presents theentry-wise accuracy of estimated 1198642 relative to (23) thefirst column presents the estimation bias the second andthird columns are the empirical standard deviation and theempirical P-values of the estimates from which we canconclude that the fitting accuracy is relatively perfect

For robustness check we also consider the synthetic datagenerated for different 119889119905 isin 001 005 01 015 02 and theimplementation of NPSML estimation on node samples withdifferent size 1198721 and 1198722 When 1198721 and 1198722 are increasedto 100 and 10 000 respectively no significant difference canbe detected in terms of the estimation accuracy measured bythe entry-wise bias between the true and the estimated edgeweight so we omit to plot this result For the rejection ratio at90 credential level of the null hypothesis that the true andestimated edgeweight are identical this ratio is lowered downa bit for the block network to less than 6 but no significantdecreasing can be detected for the general network exampleThis observation might be caused by the fact that for generalnetwork there are much more free parameters to estimatewhich reduces the convergence speed As for the different119889119905 the variation of estimation accuracy is not significant in

Complexity 11

all aspects this fact agrees with the discussion in the end ofSection 34

5 Experiment with Rumor Spreadingon Twitter

To demonstrate the usefulness of the NPSML method inreal-world applications we carry out an experiment with thedistribution flow data of a real rumor spreading process onTwitter We collect a data set of tweet articles with regardto the famous event ldquoUnite the Right rallyrdquo The ldquoUnite theRight rallyrdquo also known as the Charlottesville rally or Char-lottesville riots was a white supremacist rally that occurredin Charlottesville Virginia from August 11 to 12 2017 Therally occurred amidst the backdrop of controversy generatedby the removal of Confederate monuments throughout thecountry in response to the Charleston church shootingin 2015 The event turned violent after protesters clashedwith counter-protesters leaving over 30 injured The rallyalso attracted wide attentions on Twitter Twitter users ledvigilante campaigns on the platforms to personally identifyand denounce individual marchers in the rally following thestart of the campaignmany of themarchers were shamed andvilified by the social media community with several of therally attendees being dismissed from their jobs as a result ofthe campaign

Although the rally occurred in Charlottesville originallymessages andor comments related to it are immediatelyspread out through Twitter to users in many other placesincluding all major cities in US which inspired subsequentvigils and demonstrations in a number of cities across thecountry in the following days from Aug 11 and 12 2017 Tothis event we collect a time series of user level information(during the time from Aug 11 to Sep 4 2017) that recordedall Twitter user accounts in 20+ cities that spread at leastonce any messagecomment related to the rally during thecollection period We also collect the reaction time of everyuser to relevant messages and the user-specific informationsuch as the number of followers friends that an user has andhow many tweets the user has published in the past (historyposts) In addition the registration location of the Twitteraccount and its corresponding latitude and longitude are alsocollected

Similar to most rumor spreading data it is not possible totrack how every single message is spread from user to user byour collected data thus there is no way to directly identifythe interaction network among users But it is possible togenerate the distribution flows of users who have joined thespreading process Formally we can define at each time point119905 that a user has joined the process if and only if by 119905 heshehas at least reacted once to the messagescomments relatedto the rally then the data set can be easily converted to day-by-day distribution flows where at every time (day) 119905 sincethe origin (Aug 11 2017) we have an 119873-dimensional 0 1-valued vector with119873 being the number of all users in recordThe 119894th coordinate takes value 1 if and only if the 119894th user hasreacted to the rally-message at least once by 119905

For such a distribution flow data set we are interestedin making inference of features of the interaction network

between users because they are useful for making predictionfor the other spreading processes on Twitter regarding thesimilar social events To that end we apply the NPSMLmethod to estimate the hidden interaction network from theflow data Since there are 100000+ users in our record andit is likely that many users belong to the same latent group sothat their response pattern is similar to their common groupmembers it is more appropriate to assume the interactionnetwork behind our flow data is a block network and thenapply the NPSML to the block network model discussed inSection 35

To uncover the dependence of interaction links betweenusers on their geographical features andor friendshipfol-lowership relation we embed nodes(users) of the interac-tion network into a 5-dimensional feature space with thecoordinates representing the latitude longitude of accountlocation the number of friends followers and history postsrespectively To reduce the computation burden we adopt thebootstrap method randomly pick 10000 users from the fullset of users for 10 times and estimate the block network oneach of the subsamples For every subsample an estimatorfor membership weight function 1198641 and interaction matrix1198642 can be derived The aggregated estimator for interactionmatrix 1198642 is averaged over all subsample estimators for theblock membership weight 1198641 the aggregated estimator isderived by maximum a posteriori from the set of subsampleestimators

For robustness check we select 119889119905 isin 001 005 01 02to solve (9) As block network is used there is no need todraw the1198722 samples of node pairs only1198721 sampled nodesare needed for evaluating 119903 To reduce computation burdenwe consider to take a much smaller 1198721 than the number ofall users in record (10000+) to approximate the membershipweight function 1198641 and distribution function 119903 To check therobustness of our estimation with respect to different choiceof1198721 we preliminarily run the estimation program on a setof different 1198721 isin 50 100 200 500 The feature vectorof the 1198721 nodes in each trail is selected by conducting aK-means clustering on the full sample with the number ofclusters equal to 1198721 then the set of cluster centres will beselected as the feature vector Such selected feature vectorfor the1198721 nodes distributes asymptotically in the same waywithin the feature space as for the full sample of nodes Thepreliminary result shows that the estimators are not sensitiveto different choice of 119889119905 and become stable when1198721 is greaterthan 50 Therefore we will fix 119889119905 = 02 and 1198721 = 100 the100 cluster centres are also used as the evaluation nodes forthe estimated function 1198641

The choice of best block number is still based on max-imization of BIC value We plot the BIC for the three casesthat the block number equals to 3 4 and 5 in Figure 2 andthe BIC reaches its maximum when block number is 4 so weconsider a block network with 4 blocks as the final model forfurther analysis

Different visualizations of the block network are pro-vided Figure 3 sketches the geographic range of everyblockcommunity of the Twitter network the amount offollowers friends and history posts is plotted along with

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 9: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

Complexity 9

and Barunik [36] and the general properties of maximumlikelihood estimator So the theoretical validity of the esti-mators developed in previous sections is established

Remark 2 (sparsity) Although in general the complete iden-tifiability for both the general network and the block networkis hard to achieve but if we follow the idea in the networkreconstruction literature Shen et al [40] only concentrateon the case that the hidden network is as sparse as possiblein the sense the 1198712 norm of the edge weight function11986422 = intR119901timesR119901(119864(119909 119910))2119889119865(119909)119889119865(119910) for the general networkandor the entry-wise square sum of the block network119864222 = sum119894119895(1198642119894119895)2 (this is the 1198712 norm on the discreteset with cardinality 1198762) is as small as possible To automatethe selection of the sparsest network we can consider the1198712 norm function as a penalty and subtract it from thelog-likelihood function (6) and then optimizing (6) wouldguarantee the solution converging to the sparsest networkIt is easily verified that such a sparse solution is alwaysasymptotically unique because as we discussed in previousparagraphs all networks that can lead to exactly the samelog-likelihood function form a compact convex set in thefunctional space by the compactness and convexity therealways exists a unique 119864 (or 1198642) such that its 1198712-distance tothe origin reaches the minimum

4 Numerical Experiment with Synthetic Data

Two synthetic data sets are generated from simulation totest the effectiveness of the NPSML estimator designed inprevious sections one for the fully nonparametric networkand the other for the block network For both examples thenode set N consists of 200 nodes which are drawn purelyrandomly from the unit cube [0 1)2 thus these nodes followthe uniform distribution Consider the following modelsetup

Example 1 (full nonparametric network) Edge function 119864 isnegatively proportional to the standard Euclidean distancebetween two nodes ie

119864 (119909 119910) = 1 minus radic⟨119909 minus 119910 119909 minus 119910⟩2 (21)

Example 2 (block network) Set 119876 = 3 block membershipfunction 1198641 satisfies

1198641 (119909 119910) = (1 0 0) 119894119891 119909 + 1199102 lt 13 (0 1 0) 119894119891 13 ge 119909 + 1199102 lt 23 (0 0 1) 119890119897119904119890 (22)

Matrix 1198642 is given as follows

1198642 = ( 0 1 0508 0 03001 0 0 ) (23)

For both examples the spreading process is initializedas that 30 of all nodes are infected at the very beginningand the infected nodes are randomly picked from the nodeset The full spreading process is generated from a discreteversion of (2) with sufficiently small time step (eg 119889119905 = 001that makes the resulting distribution flows as the first-orderapproximation to the true flows) a coarse time step (119889119905 = 01)is used for the estimation procedure (9) in order to test therobustness The process is followed up until day 5 ie thetime horizon in this simulation study is [0 119905) with 119905 = 5The observation of the distribution flows is supposed to beavailable only at the initial time and the end of every day iethere are 6 chances to observe the distribution of infectionsat 119905 = 0 1 2 3 4 5

For the fully nonparametric Example 1 the spreadingprocess is regenerated for 100 times with 100 random initial-izations this is necessary to address the identification issuesas pointed out in Section 36 For the 100 trails both the nodeset and the initial infectious subset are regenerated althoughtheir distributions are held constant For the block networkExample 2 the spreading process is generated only once inorder to evaluate the fitting performance under the situationthat no repeated observation of the spreading process isavailable For both examples the estimated edge function isevaluated on afixed set of grids for easy comparisonwhere thegrid set forms a lattice of the unit cube ieG = (01119896 01119897) 119896 119897 = 0 1 10

If all nodes are included in the computation of theNPSML estimator there are in principle a 40000(= 200 times200)-dimensional parameter space for full nonparametricnetwork Example 1 and a 609(= 200times3+3times3)-dimensionalparameter space for block network Example 2 to be searchedwhich are too time consuming As in the introduction ofNPSML estimator by the smoothness of edge functionthe number of nodes actually used to evaluate the edgefunction can be much smaller than the size of the entirenode set So to reduce computation load we generate another1198721 = 20 nodes from the uniform distribution which will beused in Step 3 (Section 33) for simulating the distributionfunction 119903 Accordingly the 1198722 = 400 node pairs willbe selected as the product of the 20 nodes for the fullynonparametric Example 1 then there are 400 parameters tooptimize in Example 1 and the size is quite reasonable formost nonparametric tasks For the block network Example 2as no node pairs are needed for block networks there areonly 69(= 20 times 3 + 3 times 3) parameters to optimize As for theselection of kernel width ℎ1 ℎ2 and ℎ3 we set ℎ1 = 400minus15ℎ2 = 200minus13 and ℎ3 = 20minus13 This is because the kernelsmooth method requires kernel width ℎ to satisfy 119899ℎ119896 997888rarr infinand 119899ℎ119896+2 997888rarr 0 in order to guarantee the consistency andasymptotic normality [28 29 36 52] where 119899 is input samplesize and 119896 is the dimension of the data By a rule of thumbwe select the kernel width as ℎ = 119899minus1(119896+1) For ℎ1 it is onlyused in Example 1 to estimate the edge function where thesample size is1198722 = 400 and the data dimension is two timesof the dimension of node space thus 119896 is 4 For ℎ2 and ℎ3they are used in both examples for estimating the distributionfunction 119903 thus data dimension 119896 is always 2The sample size

10 Complexity

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

(a) Fitting accuracy for fully nonparametric network

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

00

00

(b) Fitting accuracy for block network

Figure 1 Fitting accuracy for networks in Examples 1 and 2

for ℎ2 is 200 because it is used to turn the real observed 119903 on200 nodes to its kernel smooth version and the sample size forℎ3 is 20 because it turns the estimated 119903 on 20 sampled nodesto its values on the full node set

For the inference of the block network the number ofblock119876 is usually not known in prior so it is also a parameterto estimate As119876 determines the model dimension we adoptthe classical Bayesian information criteria (BIC) introducedin Schwarz [53] to detect the correct model dimension Asdefined in Schwarz [53] the greater BIC for a fitted modelimplies the better explanatory power [53] therefore the bestchoice of119876 corresponds to the maximal BIC In practice it isnot possible to calculate the BIC value for all positive119876 so wefollow the convention and only compute the BIC on a smallset of 119876 isin 1 2 3 4 5 The 119876 associated with the maximalBIC and the corresponding estimates of 1198641 1198642 are selectedas the final estimators and reported in the following In ourexample the correct119876 = 3 is always achieved so we omit thistrivial result

In Figure 1 we plot the difference between the real edgefunction and the NPSML estimated edge function on the setG timesG of node pairs for both examples where the horizontalaxis represents the true value of edge weight on every nodepair and the vertical axis represents the estimated weighton the same node pair To facilitate visualization Figure 1is sorted according to the horizontal axis in an ascendingmannerThe red dots represent the pairs of (estimatedweighttrue weight) the blue line sketches the identity function 119910 =119909 therefore a red dot being closer to the blue line meansthe better fitting accuracy Apparently for most of nodepairs the difference is negligible To further verify this visualjudgement 1205942 test is carried out for every node pair (119909 119910) isinGtimesGwith the null hypothesis 119864119909119910 = (119864(119909 119910)minus119864(119909 119910))2 = 0Following the asymptotic normality of NPSML estimator 119864 atevery (119909 119910) the distribution of test statistics 1198641199091199101205902119909119910 undernull hypothesis should be a 1205942 distribution with degree offreedom 1 where 120590119909119910 is the asymptotic variance of estimator119864(119909 119910) which can be calculated by bootstrap method Wecount the number of node pairs that fail to support the nullhypothesis at 90 credential level the result shows that in

Table 1 Estimation accuracy of 1198642Entries Bias Std P value119864211 0021 0032 0468119864212 -0006 0012 0383119864213 -0003 0029 0057119864221 -0001 0029 0028119864222 0022 0022 066119864223 -0002 0028 0059119864231 0005 0024 0165119864232 0018 0029 048119864233 0016 0021 0554

both examples only less than 10 out of all 10000 evaluationpairs in G times G fail to support the null hypothesis So ourestimation accuracy is quite satisfactory which agrees withthe visualization in Figure 1

For the block network Example 2 Table 1 presents theentry-wise accuracy of estimated 1198642 relative to (23) thefirst column presents the estimation bias the second andthird columns are the empirical standard deviation and theempirical P-values of the estimates from which we canconclude that the fitting accuracy is relatively perfect

For robustness check we also consider the synthetic datagenerated for different 119889119905 isin 001 005 01 015 02 and theimplementation of NPSML estimation on node samples withdifferent size 1198721 and 1198722 When 1198721 and 1198722 are increasedto 100 and 10 000 respectively no significant difference canbe detected in terms of the estimation accuracy measured bythe entry-wise bias between the true and the estimated edgeweight so we omit to plot this result For the rejection ratio at90 credential level of the null hypothesis that the true andestimated edgeweight are identical this ratio is lowered downa bit for the block network to less than 6 but no significantdecreasing can be detected for the general network exampleThis observation might be caused by the fact that for generalnetwork there are much more free parameters to estimatewhich reduces the convergence speed As for the different119889119905 the variation of estimation accuracy is not significant in

Complexity 11

all aspects this fact agrees with the discussion in the end ofSection 34

5 Experiment with Rumor Spreadingon Twitter

To demonstrate the usefulness of the NPSML method inreal-world applications we carry out an experiment with thedistribution flow data of a real rumor spreading process onTwitter We collect a data set of tweet articles with regardto the famous event ldquoUnite the Right rallyrdquo The ldquoUnite theRight rallyrdquo also known as the Charlottesville rally or Char-lottesville riots was a white supremacist rally that occurredin Charlottesville Virginia from August 11 to 12 2017 Therally occurred amidst the backdrop of controversy generatedby the removal of Confederate monuments throughout thecountry in response to the Charleston church shootingin 2015 The event turned violent after protesters clashedwith counter-protesters leaving over 30 injured The rallyalso attracted wide attentions on Twitter Twitter users ledvigilante campaigns on the platforms to personally identifyand denounce individual marchers in the rally following thestart of the campaignmany of themarchers were shamed andvilified by the social media community with several of therally attendees being dismissed from their jobs as a result ofthe campaign

Although the rally occurred in Charlottesville originallymessages andor comments related to it are immediatelyspread out through Twitter to users in many other placesincluding all major cities in US which inspired subsequentvigils and demonstrations in a number of cities across thecountry in the following days from Aug 11 and 12 2017 Tothis event we collect a time series of user level information(during the time from Aug 11 to Sep 4 2017) that recordedall Twitter user accounts in 20+ cities that spread at leastonce any messagecomment related to the rally during thecollection period We also collect the reaction time of everyuser to relevant messages and the user-specific informationsuch as the number of followers friends that an user has andhow many tweets the user has published in the past (historyposts) In addition the registration location of the Twitteraccount and its corresponding latitude and longitude are alsocollected

Similar to most rumor spreading data it is not possible totrack how every single message is spread from user to user byour collected data thus there is no way to directly identifythe interaction network among users But it is possible togenerate the distribution flows of users who have joined thespreading process Formally we can define at each time point119905 that a user has joined the process if and only if by 119905 heshehas at least reacted once to the messagescomments relatedto the rally then the data set can be easily converted to day-by-day distribution flows where at every time (day) 119905 sincethe origin (Aug 11 2017) we have an 119873-dimensional 0 1-valued vector with119873 being the number of all users in recordThe 119894th coordinate takes value 1 if and only if the 119894th user hasreacted to the rally-message at least once by 119905

For such a distribution flow data set we are interestedin making inference of features of the interaction network

between users because they are useful for making predictionfor the other spreading processes on Twitter regarding thesimilar social events To that end we apply the NPSMLmethod to estimate the hidden interaction network from theflow data Since there are 100000+ users in our record andit is likely that many users belong to the same latent group sothat their response pattern is similar to their common groupmembers it is more appropriate to assume the interactionnetwork behind our flow data is a block network and thenapply the NPSML to the block network model discussed inSection 35

To uncover the dependence of interaction links betweenusers on their geographical features andor friendshipfol-lowership relation we embed nodes(users) of the interac-tion network into a 5-dimensional feature space with thecoordinates representing the latitude longitude of accountlocation the number of friends followers and history postsrespectively To reduce the computation burden we adopt thebootstrap method randomly pick 10000 users from the fullset of users for 10 times and estimate the block network oneach of the subsamples For every subsample an estimatorfor membership weight function 1198641 and interaction matrix1198642 can be derived The aggregated estimator for interactionmatrix 1198642 is averaged over all subsample estimators for theblock membership weight 1198641 the aggregated estimator isderived by maximum a posteriori from the set of subsampleestimators

For robustness check we select 119889119905 isin 001 005 01 02to solve (9) As block network is used there is no need todraw the1198722 samples of node pairs only1198721 sampled nodesare needed for evaluating 119903 To reduce computation burdenwe consider to take a much smaller 1198721 than the number ofall users in record (10000+) to approximate the membershipweight function 1198641 and distribution function 119903 To check therobustness of our estimation with respect to different choiceof1198721 we preliminarily run the estimation program on a setof different 1198721 isin 50 100 200 500 The feature vectorof the 1198721 nodes in each trail is selected by conducting aK-means clustering on the full sample with the number ofclusters equal to 1198721 then the set of cluster centres will beselected as the feature vector Such selected feature vectorfor the1198721 nodes distributes asymptotically in the same waywithin the feature space as for the full sample of nodes Thepreliminary result shows that the estimators are not sensitiveto different choice of 119889119905 and become stable when1198721 is greaterthan 50 Therefore we will fix 119889119905 = 02 and 1198721 = 100 the100 cluster centres are also used as the evaluation nodes forthe estimated function 1198641

The choice of best block number is still based on max-imization of BIC value We plot the BIC for the three casesthat the block number equals to 3 4 and 5 in Figure 2 andthe BIC reaches its maximum when block number is 4 so weconsider a block network with 4 blocks as the final model forfurther analysis

Different visualizations of the block network are pro-vided Figure 3 sketches the geographic range of everyblockcommunity of the Twitter network the amount offollowers friends and history posts is plotted along with

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 10: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

10 Complexity

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

(a) Fitting accuracy for fully nonparametric network

10

08

06

04

02True

edge

wei

ght

1008060402

Estimated edge weight

Est vs Truey=x

00

00

(b) Fitting accuracy for block network

Figure 1 Fitting accuracy for networks in Examples 1 and 2

for ℎ2 is 200 because it is used to turn the real observed 119903 on200 nodes to its kernel smooth version and the sample size forℎ3 is 20 because it turns the estimated 119903 on 20 sampled nodesto its values on the full node set

For the inference of the block network the number ofblock119876 is usually not known in prior so it is also a parameterto estimate As119876 determines the model dimension we adoptthe classical Bayesian information criteria (BIC) introducedin Schwarz [53] to detect the correct model dimension Asdefined in Schwarz [53] the greater BIC for a fitted modelimplies the better explanatory power [53] therefore the bestchoice of119876 corresponds to the maximal BIC In practice it isnot possible to calculate the BIC value for all positive119876 so wefollow the convention and only compute the BIC on a smallset of 119876 isin 1 2 3 4 5 The 119876 associated with the maximalBIC and the corresponding estimates of 1198641 1198642 are selectedas the final estimators and reported in the following In ourexample the correct119876 = 3 is always achieved so we omit thistrivial result

In Figure 1 we plot the difference between the real edgefunction and the NPSML estimated edge function on the setG timesG of node pairs for both examples where the horizontalaxis represents the true value of edge weight on every nodepair and the vertical axis represents the estimated weighton the same node pair To facilitate visualization Figure 1is sorted according to the horizontal axis in an ascendingmannerThe red dots represent the pairs of (estimatedweighttrue weight) the blue line sketches the identity function 119910 =119909 therefore a red dot being closer to the blue line meansthe better fitting accuracy Apparently for most of nodepairs the difference is negligible To further verify this visualjudgement 1205942 test is carried out for every node pair (119909 119910) isinGtimesGwith the null hypothesis 119864119909119910 = (119864(119909 119910)minus119864(119909 119910))2 = 0Following the asymptotic normality of NPSML estimator 119864 atevery (119909 119910) the distribution of test statistics 1198641199091199101205902119909119910 undernull hypothesis should be a 1205942 distribution with degree offreedom 1 where 120590119909119910 is the asymptotic variance of estimator119864(119909 119910) which can be calculated by bootstrap method Wecount the number of node pairs that fail to support the nullhypothesis at 90 credential level the result shows that in

Table 1 Estimation accuracy of 1198642Entries Bias Std P value119864211 0021 0032 0468119864212 -0006 0012 0383119864213 -0003 0029 0057119864221 -0001 0029 0028119864222 0022 0022 066119864223 -0002 0028 0059119864231 0005 0024 0165119864232 0018 0029 048119864233 0016 0021 0554

both examples only less than 10 out of all 10000 evaluationpairs in G times G fail to support the null hypothesis So ourestimation accuracy is quite satisfactory which agrees withthe visualization in Figure 1

For the block network Example 2 Table 1 presents theentry-wise accuracy of estimated 1198642 relative to (23) thefirst column presents the estimation bias the second andthird columns are the empirical standard deviation and theempirical P-values of the estimates from which we canconclude that the fitting accuracy is relatively perfect

For robustness check we also consider the synthetic datagenerated for different 119889119905 isin 001 005 01 015 02 and theimplementation of NPSML estimation on node samples withdifferent size 1198721 and 1198722 When 1198721 and 1198722 are increasedto 100 and 10 000 respectively no significant difference canbe detected in terms of the estimation accuracy measured bythe entry-wise bias between the true and the estimated edgeweight so we omit to plot this result For the rejection ratio at90 credential level of the null hypothesis that the true andestimated edgeweight are identical this ratio is lowered downa bit for the block network to less than 6 but no significantdecreasing can be detected for the general network exampleThis observation might be caused by the fact that for generalnetwork there are much more free parameters to estimatewhich reduces the convergence speed As for the different119889119905 the variation of estimation accuracy is not significant in

Complexity 11

all aspects this fact agrees with the discussion in the end ofSection 34

5 Experiment with Rumor Spreadingon Twitter

To demonstrate the usefulness of the NPSML method inreal-world applications we carry out an experiment with thedistribution flow data of a real rumor spreading process onTwitter We collect a data set of tweet articles with regardto the famous event ldquoUnite the Right rallyrdquo The ldquoUnite theRight rallyrdquo also known as the Charlottesville rally or Char-lottesville riots was a white supremacist rally that occurredin Charlottesville Virginia from August 11 to 12 2017 Therally occurred amidst the backdrop of controversy generatedby the removal of Confederate monuments throughout thecountry in response to the Charleston church shootingin 2015 The event turned violent after protesters clashedwith counter-protesters leaving over 30 injured The rallyalso attracted wide attentions on Twitter Twitter users ledvigilante campaigns on the platforms to personally identifyand denounce individual marchers in the rally following thestart of the campaignmany of themarchers were shamed andvilified by the social media community with several of therally attendees being dismissed from their jobs as a result ofthe campaign

Although the rally occurred in Charlottesville originallymessages andor comments related to it are immediatelyspread out through Twitter to users in many other placesincluding all major cities in US which inspired subsequentvigils and demonstrations in a number of cities across thecountry in the following days from Aug 11 and 12 2017 Tothis event we collect a time series of user level information(during the time from Aug 11 to Sep 4 2017) that recordedall Twitter user accounts in 20+ cities that spread at leastonce any messagecomment related to the rally during thecollection period We also collect the reaction time of everyuser to relevant messages and the user-specific informationsuch as the number of followers friends that an user has andhow many tweets the user has published in the past (historyposts) In addition the registration location of the Twitteraccount and its corresponding latitude and longitude are alsocollected

Similar to most rumor spreading data it is not possible totrack how every single message is spread from user to user byour collected data thus there is no way to directly identifythe interaction network among users But it is possible togenerate the distribution flows of users who have joined thespreading process Formally we can define at each time point119905 that a user has joined the process if and only if by 119905 heshehas at least reacted once to the messagescomments relatedto the rally then the data set can be easily converted to day-by-day distribution flows where at every time (day) 119905 sincethe origin (Aug 11 2017) we have an 119873-dimensional 0 1-valued vector with119873 being the number of all users in recordThe 119894th coordinate takes value 1 if and only if the 119894th user hasreacted to the rally-message at least once by 119905

For such a distribution flow data set we are interestedin making inference of features of the interaction network

between users because they are useful for making predictionfor the other spreading processes on Twitter regarding thesimilar social events To that end we apply the NPSMLmethod to estimate the hidden interaction network from theflow data Since there are 100000+ users in our record andit is likely that many users belong to the same latent group sothat their response pattern is similar to their common groupmembers it is more appropriate to assume the interactionnetwork behind our flow data is a block network and thenapply the NPSML to the block network model discussed inSection 35

To uncover the dependence of interaction links betweenusers on their geographical features andor friendshipfol-lowership relation we embed nodes(users) of the interac-tion network into a 5-dimensional feature space with thecoordinates representing the latitude longitude of accountlocation the number of friends followers and history postsrespectively To reduce the computation burden we adopt thebootstrap method randomly pick 10000 users from the fullset of users for 10 times and estimate the block network oneach of the subsamples For every subsample an estimatorfor membership weight function 1198641 and interaction matrix1198642 can be derived The aggregated estimator for interactionmatrix 1198642 is averaged over all subsample estimators for theblock membership weight 1198641 the aggregated estimator isderived by maximum a posteriori from the set of subsampleestimators

For robustness check we select 119889119905 isin 001 005 01 02to solve (9) As block network is used there is no need todraw the1198722 samples of node pairs only1198721 sampled nodesare needed for evaluating 119903 To reduce computation burdenwe consider to take a much smaller 1198721 than the number ofall users in record (10000+) to approximate the membershipweight function 1198641 and distribution function 119903 To check therobustness of our estimation with respect to different choiceof1198721 we preliminarily run the estimation program on a setof different 1198721 isin 50 100 200 500 The feature vectorof the 1198721 nodes in each trail is selected by conducting aK-means clustering on the full sample with the number ofclusters equal to 1198721 then the set of cluster centres will beselected as the feature vector Such selected feature vectorfor the1198721 nodes distributes asymptotically in the same waywithin the feature space as for the full sample of nodes Thepreliminary result shows that the estimators are not sensitiveto different choice of 119889119905 and become stable when1198721 is greaterthan 50 Therefore we will fix 119889119905 = 02 and 1198721 = 100 the100 cluster centres are also used as the evaluation nodes forthe estimated function 1198641

The choice of best block number is still based on max-imization of BIC value We plot the BIC for the three casesthat the block number equals to 3 4 and 5 in Figure 2 andthe BIC reaches its maximum when block number is 4 so weconsider a block network with 4 blocks as the final model forfurther analysis

Different visualizations of the block network are pro-vided Figure 3 sketches the geographic range of everyblockcommunity of the Twitter network the amount offollowers friends and history posts is plotted along with

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 11: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

Complexity 11

all aspects this fact agrees with the discussion in the end ofSection 34

5 Experiment with Rumor Spreadingon Twitter

To demonstrate the usefulness of the NPSML method inreal-world applications we carry out an experiment with thedistribution flow data of a real rumor spreading process onTwitter We collect a data set of tweet articles with regardto the famous event ldquoUnite the Right rallyrdquo The ldquoUnite theRight rallyrdquo also known as the Charlottesville rally or Char-lottesville riots was a white supremacist rally that occurredin Charlottesville Virginia from August 11 to 12 2017 Therally occurred amidst the backdrop of controversy generatedby the removal of Confederate monuments throughout thecountry in response to the Charleston church shootingin 2015 The event turned violent after protesters clashedwith counter-protesters leaving over 30 injured The rallyalso attracted wide attentions on Twitter Twitter users ledvigilante campaigns on the platforms to personally identifyand denounce individual marchers in the rally following thestart of the campaignmany of themarchers were shamed andvilified by the social media community with several of therally attendees being dismissed from their jobs as a result ofthe campaign

Although the rally occurred in Charlottesville originallymessages andor comments related to it are immediatelyspread out through Twitter to users in many other placesincluding all major cities in US which inspired subsequentvigils and demonstrations in a number of cities across thecountry in the following days from Aug 11 and 12 2017 Tothis event we collect a time series of user level information(during the time from Aug 11 to Sep 4 2017) that recordedall Twitter user accounts in 20+ cities that spread at leastonce any messagecomment related to the rally during thecollection period We also collect the reaction time of everyuser to relevant messages and the user-specific informationsuch as the number of followers friends that an user has andhow many tweets the user has published in the past (historyposts) In addition the registration location of the Twitteraccount and its corresponding latitude and longitude are alsocollected

Similar to most rumor spreading data it is not possible totrack how every single message is spread from user to user byour collected data thus there is no way to directly identifythe interaction network among users But it is possible togenerate the distribution flows of users who have joined thespreading process Formally we can define at each time point119905 that a user has joined the process if and only if by 119905 heshehas at least reacted once to the messagescomments relatedto the rally then the data set can be easily converted to day-by-day distribution flows where at every time (day) 119905 sincethe origin (Aug 11 2017) we have an 119873-dimensional 0 1-valued vector with119873 being the number of all users in recordThe 119894th coordinate takes value 1 if and only if the 119894th user hasreacted to the rally-message at least once by 119905

For such a distribution flow data set we are interestedin making inference of features of the interaction network

between users because they are useful for making predictionfor the other spreading processes on Twitter regarding thesimilar social events To that end we apply the NPSMLmethod to estimate the hidden interaction network from theflow data Since there are 100000+ users in our record andit is likely that many users belong to the same latent group sothat their response pattern is similar to their common groupmembers it is more appropriate to assume the interactionnetwork behind our flow data is a block network and thenapply the NPSML to the block network model discussed inSection 35

To uncover the dependence of interaction links betweenusers on their geographical features andor friendshipfol-lowership relation we embed nodes(users) of the interac-tion network into a 5-dimensional feature space with thecoordinates representing the latitude longitude of accountlocation the number of friends followers and history postsrespectively To reduce the computation burden we adopt thebootstrap method randomly pick 10000 users from the fullset of users for 10 times and estimate the block network oneach of the subsamples For every subsample an estimatorfor membership weight function 1198641 and interaction matrix1198642 can be derived The aggregated estimator for interactionmatrix 1198642 is averaged over all subsample estimators for theblock membership weight 1198641 the aggregated estimator isderived by maximum a posteriori from the set of subsampleestimators

For robustness check we select 119889119905 isin 001 005 01 02to solve (9) As block network is used there is no need todraw the1198722 samples of node pairs only1198721 sampled nodesare needed for evaluating 119903 To reduce computation burdenwe consider to take a much smaller 1198721 than the number ofall users in record (10000+) to approximate the membershipweight function 1198641 and distribution function 119903 To check therobustness of our estimation with respect to different choiceof1198721 we preliminarily run the estimation program on a setof different 1198721 isin 50 100 200 500 The feature vectorof the 1198721 nodes in each trail is selected by conducting aK-means clustering on the full sample with the number ofclusters equal to 1198721 then the set of cluster centres will beselected as the feature vector Such selected feature vectorfor the1198721 nodes distributes asymptotically in the same waywithin the feature space as for the full sample of nodes Thepreliminary result shows that the estimators are not sensitiveto different choice of 119889119905 and become stable when1198721 is greaterthan 50 Therefore we will fix 119889119905 = 02 and 1198721 = 100 the100 cluster centres are also used as the evaluation nodes forthe estimated function 1198641

The choice of best block number is still based on max-imization of BIC value We plot the BIC for the three casesthat the block number equals to 3 4 and 5 in Figure 2 andthe BIC reaches its maximum when block number is 4 so weconsider a block network with 4 blocks as the final model forfurther analysis

Different visualizations of the block network are pro-vided Figure 3 sketches the geographic range of everyblockcommunity of the Twitter network the amount offollowers friends and history posts is plotted along with

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 12: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

12 Complexity

Table 2 Mean features of 4 communities

Followers Friends History posts Lat LonBig name community 1474739 123835 149494 3078 -8999Famous active community 535641 25967 137372 3418 -11759Famous inactive community 500197 3519 102222 4075 -8255Nobody community 21658 3770 113593 4677 -12246

minus10000

minus10050

minus10100

minus10150

minus10200

minus10250

minus10300

minus10350

BIC

block_dim=3 block_dim=4 block_dim=5

Figure 2 BIC for different block numbers

locations of every user within every community in subfigures(a) (b) and (c) respectively Note that the 100 users in plot 3are synthetic in the sense that their attributes are describedby the centre vectors of 100 clusters yielded from applyingK-means clustering to the full set of 10000+ users Becausethe clustering is taken on a 5-dimensional feature space thelocation of every synthetic user may not lie exactly withina city in the US nor around a group of neighboring citiesAlthough the deviation between synthetic users and real usersseems to be anomalous it does reflect the information losswhen the higher-dimensional cluster is projected to a low-dimensional space this part of lost information can playa critical role in determining the community membershipof both the synthetic and real users To see this considerthe synthetic user represented by the largest green dot inFigure 3(a) its geographic location is obviously not close toevery city or cities group within our record To be groupedinto the same cluster by K-means method all real userscorresponding to this synthetic user have to have the propertythat they are quite far away from each other geographicallybut highly analogous in the other dimension of featuressuch as the number of followers in this case Consequentlythe community membership of the giant green-dot user andthe real users represented by it is not fully determined bygeographic factors while it is more likely to depend on theextra social factors such as the amount of followers whichare not directly related to usersrsquo locations This observationalso justifies the necessity of including extra information intothe analysis of information spreading process on Twitter

From the mean value of every feature reported in Table 2the four user communities can be roughly summarized bytheir activeness as follows (1) big name community withinwhich the users are more likely to have a giant group offollowers and friends meanwhile they are highly active onTwitter (2) nobody community within this community users

have a fairly small number of followers and friends comparedto the other three communities their history posts are notquite active either (3) famous inactive community users inthis community have quite a lot of followers but only a fewfriends and a relatively small amount of history posts so thisgroup of users might be ldquostarsrdquo in some fields (large followergroup) but they are less likely to interact with the otherson Twitter and therefore are not active (4) famous activecommunity users in this community do havemany followersbut different from inactive community the average numberof friends and history posts is huge which indicates that theyare very active on Twitter

If we further exam the spatial distribution of featureswithin every community in Figure 3 it is found that (1)for the amount of followers and friends their spatial dis-tribution is highly uneven within every community thereare only one or two synthetic users with extremely largevalue this uneven distribution pattern suggests a classicalcentre-periphery structurewithin a community and the userswith greatest amount of followers andor friends are leadersfor the spreading of opinions within their own communityand across different communities (2) the amount of historyposts is much more evenly distributed within all the fourcommunities which reflects the important characteristics ofsocial media that every user on it has the same right toexpress their own opinion no matter whether or not they arefamous or influential in the real life (3) although users withinevery community are not gathered spatially there exists aweak spatial segregation pattern of the four communities(the segregation can be better visualized in Figure 4) tobetter understand the source of the spatial segregation futurestudies are needed

The link strength between different communities is pre-sented in Table 3 (the ldquoFromrdquo label in the column headerindicates that values in each column representing the impactstrength from the community in the column header to theother communities the ldquoTordquo label in the row name indicatesthat values in each row representing the impact strengthfrom the other communities to the community in the rowlabel) and visualized in Figure 4 Apparently a significanthierarchical structure can be concluded from the link matrixbig name community dominates all the other communitiesin terms of their sensitivity to social opinions followed bythe famous active community But compared to the famousactive community the big name community is more likelyto accept arguments sourced from the nobody and famousinactive community For famous inactive community theyonly read the tweets posted by members in the big nameand famous active communities and receive nothing from itsinsiders and users from nobody community this observation

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 13: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

Complexity 13

Communityfamous inactivefamous activebig nameno body

Followers788 - 140169140170 - 934467934468 - 46994374699438 - 1566563315665634 - 33245518

0 250 500 1000 1500 2000km

(a) Spatial distribution of followers number within different com-munities

Communityfamous inactivefamous activebig nameno body

Friends242 - 48184818 - 1243512435 - 2807228072 - 719499719499 - 3105962

0 250 500 1000 1500 2000km

(b) Spatial distribution of friend numbers within different commu-nities

2344 - 4935549355 - 133141133141 - 274841274841 - 514302514302 - 1006932

0 250 500 1000 1500 2000km

Communityfamous inactivefamous activebig nameno body

Post history

(c) Spatial distribution of history post within different communities

Figure 3 Spatial distribution of features of users within different communities

Table 3 Link matrix of 4 communities

From big namecommunity

From famousactive

community

From famousinactive

community

From nobodycommunity

To big namecommunity 1 1 1 1

To famous activecommunity 1 1 0701 0637

To famous inactivecommunity 0175 0365 0 0

To nobodycommunity 0 0 0 001

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 14: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

14 Complexity

0 250 500 1000 1500 2000km

Community Weight001 - 002002 - 017017 - 036036 - 070070 - 100

famous inactive

famous active

big name

no body

Figure 4 Estimate for interactionmatrix

reflects some kind of opinion discrimination Finally thenobody community seems to be isolated from all the othercommunities and only hear from its insiders which formsanother form of opinion discrimination [54]

From above analysis there have been quite a few interest-ing features that can be drawn out of the information spread-ing process on Twitter To better understand the formation ofthe four communities and the hierarchical structure of linkmatrix it should be helpful to do more textual mining workon the tweet articles involved in the spreading process andadd the extracted information as covariate to the spreadingprocess and reestimate the hidden block network To do soa semiparametric extension of the network estimators in thispaper is needed we leave this challenge for future researches

6 Conclusion and Future Direction

In this paper we propose a novel approach to nonparamet-rically estimate the hidden interaction network behind aninformation spreading process This approach is designed tohandle such an important feature of information spreadingprocesses that the specific spreading trajectory does notexist and only the distribution flow of the spreading statusis observable To characterize the formation of distributionflows amean-field processequation is proposed A nonpara-metric simulation-based maximum likelihood estimator isdeveloped to resolve the subtlety induced by the mean-fieldequation and the fully nonparametric network edge function

Our estimation procedure can also be applied to the blocknetwork structure a special case of the fully nonparametricnetwork

To our best knowledge our work is the first attempt toimplement a fully nonparametric estimation of the networkstructure for distribution flow data and information spread-ing process The resulting estimator is always valid if thespreading process is repeatedly observable while for thosespreading processes that are not possible to be repeatedlyobserved the estimator turns out still valid in the sensethat it is identifiable up to a compact convex set for afully nonparametric network and completely identifiable forblock network under a generic constraint Therefore forblock network the consistency and asymptotic normality canalways be established in the standardway which is enough forpractical use

Numerical experiments are conducted to verify the effec-tiveness of our estimation procedure its practical usefulnessis illustrated by a real data application where the spreadingprocess of tweet articles regarding the event ldquoUnite theRight rallyrdquo is studied and a block network is fitted Thefitting result shows that Twitter users involved in the spread-ing process can be divided into four communities whichcorrespond to big name users famous active and inactiveusers and nobody users Connections among these fourcommunities display a remarkable hierarchical structureopinion discrimination exists as expected among differentcommunities

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 15: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

Complexity 15

There are some limitations of the current studies firstwe only show that the fast algorithm is efficient in liftingthe computation speed when the number of observationtimes is relatively small compared to the total number ofnodes but a low observation frequency might enlarge theestimation bias In practice how to balance the estimationaccuracy and the computation is tricky and further studiesare needed Second high frequent observation may notalways be possible in many applications In the Twitter dataanalyzed in this paper the exact time of posting is availablewhich makes it possible to extract arbitrarily high frequentdistribution flows from the given data But in many otherapplications the distribution flows are stored in the formof a series of snapshots with fixed length of observationalinterval In that case the observation frequency is strictlycontrolled by the interval length and not stretchable at all forwhich how to develop a reasonable algorithm is still an openquestion Third as mentioned in Section 36 the completeidentifiability for the fully nonparametric network is notachievable So constraints are needed to guarantee the desiredidentifiability Although as shown in Remark 2 sparsity isa good constraint to lead identifiability it may not alwaysbe reasonable Therefore a further study on the feasible andproper identification condition should be very meaningful inboth theoretical and practical aspects

Data Availability

The data sample and Python code used in this article areavailable per request from the corresponding author throughxiaoqizhbuff aloedu

Conflicts of Interest

The authors declare no conflicts of interest regarding thepublication of this manuscript

Authorsrsquo Contributions

Conceptualization was carried out by Xiaoqi Zhang YanqiaoZheng and Xinyue Yemethodology is done by Xiaoqi Zhangand Xiaobing Zhao software is contributed by Xiaoqi Zhangvalidation is done by Yanqiao Zheng and Xinyue Ye formalanalysis is carried out by Xiaoqi Zhang Xiaobing Zhaoand Qiwen Dai investigation is done by Yanqiao Zhengresources are contributed by Xiaobing Zhao and Xinyue Yedata curation is done by Xinyue Ye original draft preparationis carried out by Xiaoqi Zhang and Yanqiao Zheng reviewand editing is done by Xinyue Ye and Yanqiao Zhengvisualization is done by Qiwen Dai supervision is providedbyXiaobingZhao project administration is done byXiaobingZhao and Xinyue Ye funding acquisition is carried out byXiaobing Zhao

Acknowledgments

This work was partially supported by the China NationalPlanning Office of Philosophy and Social Sciences(18BTJ023)This work was presented at the 15th XiangrsquoZhang

Economic Forum Seminar (Beijing) the (co-)authors re-ceived valuable comments from Dr Yougui Wang and Zhi-gang Cao

References

[1] X Huang Y Zhao C Ma J Yang X Ye and C Zhang ldquoTra-jGraph a graph-based visual analytics approach to studyingurban network centralities using taxi trajectory datardquo IEEETransactions on Visualization and Computer Graphics vol 22no 1 pp 160ndash169 2016

[2] C Yang M Xiao X Ding et al ldquoExploring human mobilitypatterns using geo-tagged social media data at the group levelrdquoJournal of Spatial Science pp 1ndash18 2018

[3] S Al-Dohuki Y Wu F Kamw et al ldquoSemanticTraj a newapproach to interacting with massive taxi trajectoriesrdquo IEEETransactions on Visualization and Computer Graphics vol 23no 1 pp 11ndash20 2017

[4] L Duan X Ye T Hu and X Zhu ldquoPrediction of suspect loca-tion based on spatiotemporal semanticsrdquo ISPRS InternationalJournal of Geo-Information vol 60 no 7 p 185 2017

[5] S Han F Ren C Wu Y Chen Q Du and X Ye ldquoUsingthe tensorflow deep neural network to classify mainland chinavisitor behaviours in hong kong from check-in datardquo ISPRSInternational Journal of Geo-Information vol 7 no 4 p 1582018

[6] L Huang Y Wen X Ye C Zhou F Zhang and J Lee ldquoAnalysisof spatiotemporal trajectories for stops along taxi pathsrdquo SpatialCognition amp Computation pp 1ndash23 2018

[7] X Shi B Xue M-H Tsou et al ldquoDetecting events from thesocial media through exemplar-enhanced supervised learningrdquoInternational Journal of Digital Earth 2018

[8] Z Wang and X Ye ldquoSpace time and situational awareness innatural hazards a case study of hurricane sandy with socialmedia datardquo Cartography and Geographic Information Science2018

[9] F Chierichetti S Lattanzi andA Panconesi ldquoRumor spreadingin social networksrdquo eoretical Computer Science vol 412 no24 pp 2602ndash2610 2011

[10] N Song and L Huo ldquoDynamical interplay between the dissem-ination of scientific knowledge and rumor spreading in emer-gencyrdquo Physica A Statistical Mechanics and its Applications vol461 pp 73ndash84 2016

[11] Z He Z Cai J Yu X Wang Y Sun and Y Li ldquoCost-efficientstrategies for restraining rumor spreading in mobile socialnetworksrdquo IEEE Transactions on Vehicular Technology vol 66no 3 pp 2789ndash2800 2017

[12] Z Chen An agent-based model for information diffusion overonline social networks [PhD thesis] Kent State University 2016

[13] J Lee and X Ye ldquoAn open source spatiotemporal model forsimulating obesity prevalencerdquo in GeoComputational Analysisand Modeling of Regional Systems Advances in GeographicInformation Science pp 395ndash410 Springer International Pub-lishing Cham Switzerland 2018

[14] X Ye L Dang J Lee M Tsou and Z Chen ldquoOpen sourcesocial network simulator focusing on spatial meme diffusionrdquoinHumanDynamics Research in Smart and Connected Commu-nities Human Dynamics in Smart Cities pp 203ndash222 SpringerInternational Publishing Cham Switzerland 2018

[15] W Luo D A Katz D T Hamilton et al ldquoDevelopment of anagent-basedmodel to investigate the impact of HIV self-testing

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 16: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

16 Complexity

programs onmenwho have sex withmen in atlanta and seattlerdquoJMIR Public Health and Surveillance vol 4 no 2 article e582018

[16] L Allen F Brauer P J Van den Driessche and J WuMathematical Epidemiology vol 1945 Springer 2008

[17] L J Zhao J J Wang Y C Chen Q Wang J Cheng and HCui ldquoSIHR rumor spreading model in social networksrdquo PhysicaA Statistical Mechanics and its Applications vol 391 no 7 pp2444ndash2453 2012

[18] X Qiu L Zhao J Wang X Wang and Q Wang ldquoEffects oftime-dependent diffusion behaviors on the rumor spreading insocial networksrdquo Physics Letters A vol 380 no 24 pp 2054ndash2063 2016

[19] F Jia and G Lv ldquoDynamic analysis of a stochastic rumorpropagation modelrdquo Physica A Statistical Mechanics and itsApplications vol 490 pp 613ndash623 2018

[20] M Cristelli L Pietronero and A Zaccaria ldquoCritical overviewof agent-based models for economicsrdquo httpsarxivorgabs11011847

[21] W Luo ldquoVisual analytics of geo-social interaction patterns forepidemic controlrdquo International Journal of Health Geographicsvol 15 no 1 article 28 2016

[22] W Luo P Gao and S Cassels ldquoA large-scale location-basedsocial network to understanding the impact of human geo-social interaction patterns on vaccination strategies in anurbanized areardquo Computers Environment and Urban Systemsvol 72 pp 78ndash87 2018

[23] K Ma W Li Q Guo et al ldquoInformation spreading in complexnetworks with participation of independent spreadersrdquo PhysicaA Statistical Mechanics and Its Applications vol 492 pp 21ndash272018

[24] M Granovetter ldquoThreshold models of collective behaviorrdquoAmerican Journal of Sociology vol 83 no 6 pp 1420ndash1443 1978

[25] J Goldenberg B Libai and E Muller ldquoTalk of the networka complex systems look at the underlying process of word-of-mouthrdquoMarketing Letters vol 12 no 3 pp 211ndash223 2001

[26] D Kempe J Kleinberg and E Tardos ldquoMaximizing thespread of influence through a social networkrdquo in Proceedingsof the9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining 2003

[27] B H Spitzberg ldquoToward a model of meme diffusion (M3D)rdquoCommunication eory vol 24 no 3 pp 311ndash339 2014

[28] W Hardle Applied Nonparametric Regression EconometricSociety Monographs no 19 Cambridge University Press 1990

[29] D Kristensen and Y Shin ldquoEstimation of dynamic modelswith nonparametric simulatedmaximum likelihoodrdquo Journal ofEconometrics vol 167 no 1 pp 76ndash94 2012

[30] M E J Newman and E A Leicht ldquoMixture models andexploratory analysis in networksrdquo Proceedings of the NationalAcadamy of Sciences of the United States of America vol 104 no23 pp 9564ndash9569 2007

[31] L Lu and T Zhou ldquoLink prediction in complex networks asurveyrdquoPhysica A StatisticalMechanics and its Applications vol390 no 6 pp 1150ndash1170 2011

[32] M Salter-Townshend A White I Gollini and T B MurphyldquoReview of statistical network analysis models algorithms andsoftwarerdquo Statistical Analysis and Data Mining e ASA DataScience Journal vol 5 no 4 pp 243ndash264 2012

[33] EMAiroldi DM Blei S E Fienberg E Xing andT JaakkolaldquoMixed membership stochastic blockmodels for relational datawith application to protein-protein interactionsrdquo in Proceedings

of the International Biometrics Society Annual Meeting vol 152006

[34] P Winker and M Gilli ldquoIndirect estimation of the parametersof agent based models of financial marketsrdquo FAME WorkingPaper No 38 FAME International center for financial assetmanagement and engineering 2001

[35] J Grazzini and M Richiardi ldquoEstimation of ergodic agent-based models by simulated minimum distancerdquo Journal ofEconomic Dynamics amp Control vol 51 pp 148ndash165 2015

[36] J Kukacka and J Barunik ldquoEstimation of financial agent-based models with simulated maximum likelihoodrdquo Journal ofEconomic Dynamics amp Control vol 85 pp 21ndash45 2017

[37] T Zhou Z Kuscsik J Liu M Medo J R Wakeling and YZhang ldquoSolving the apparent diversity-accuracy dilemma ofrecommender systemsrdquo Proceedings of the National Acadamy ofSciences of the United States of America vol 107 no 10 pp 4511ndash4515 2010

[38] C Matias T Rebafka and F Villers ldquoA semiparametric exten-sion of the stochastic block model for longitudinal networksrdquoBiometrika vol 105 no 3 pp 665ndash680 2018

[39] P Bickel D Choi X Chang and H Zhang ldquoAsymptoticnormality of maximum likelihood and its variational approxi-mation for stochastic blockmodelsrdquoeAnnals of Statistics vol41 no 4 pp 1922ndash1943 2013

[40] Z ShenW-XWang Y Fan Z Di and Y-C Lai ldquoReconstruct-ing propagation networks with natural diversity and identifyinghidden sourcesrdquo Nature Communications vol 5 article 43232014

[41] Y Roudi and J Hertz ldquoMean field theory for nonequilibriumnetwork reconstructionrdquo Physical Review Letters vol 106 no4 2011

[42] H H M Weerts A G Dankers and P M J Van denHof ldquoIdentifiability in dynamic network identificationrdquo IFAC-PapersOnLine vol 48 no 28 pp 1409ndash1414 2015

[43] W-X Wang Y-C Lai C Grebogi and J Ye ldquoNetwork recon-struction based on evolutionary-game data via compressivesensingrdquo Physical Review X vol 1 no 2 Article ID 021021 pp1ndash7 2011

[44] D Hayden Y H Chang J Goncalves and C J Tomlin ldquoSparsenetwork identifiability via compressed sensingrdquo Automaticavol 68 pp 9ndash17 2016

[45] C Viboud O N Bjoslashrnstad D L Smith L Simonsen MA Miller and B T Grenfell ldquoSynchrony waves and spatialhierachies in the spread of influenzardquo Science vol 312 no 5772pp 447ndash451 2006

[46] N J Gordon D J Salmond and S Adrian ldquoNovel approachto nonlinearnon-gaussian Bayesian state estimationrdquo IEE Pro-ceedings F (Radar and Signal Processing) vol 140 no 2 pp 107ndash113 1993

[47] P D Moral ldquoMeasure-valued processes and interacting parti-cle systems application to nonlinear filtering problemsrdquo eAnnals of Applied Probability vol 80 no 2 pp 438ndash495 1998

[48] T Tanaka ldquoA theory of mean field approximationrdquo in Advancesin Neural Information Processing Systems pp 351ndash360 1999

[49] M S Arulampalam S Maskell N Gordon and T Clapp ldquoAtutorial on particle filters for online nonlinearnon-GaussianBayesian trackingrdquo IEEE Transactions on Signal Processing vol50 no 2 pp 174ndash188 2002

[50] PDelMoralMeanField Simulation forMonte Carlo IntegrationChapman and HallCRC 2013

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 17: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

Complexity 17

[51] M A Golberg ldquoThe derivative of a determinantrdquoeAmericanMathematical Monthly vol 79 no 11 pp 1124ndash1126 1972

[52] P K Andersen L S Hansen and N Keiding ldquoNon-andsemi-parametric estimation of transition probabilities fromcensored observation of a non-homogeneous markov processrdquoScandinavian Journal of Statistics vol 18 no 2 pp 153ndash167 1991

[53] G Schwarz ldquoEstimating the dimension of a modelrdquoe Annalsof Statistics vol 6 no 2 pp 461ndash464 1978

[54] J-V Cossu V Labatut and N Dugue ldquoA review of features forthe discrimination of twitter users application to the predictionof offline influencerdquo Social Network Analysis andMining vol 6no 1 p 25 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

Page 18: Mining the Hidden Link Structure from Distribution Flows for a … · 2019. 7. 30. · and Barunik [ ], simulation is conducted on the level of random variable, while, in our case,

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Applied MathematicsJournal of

Hindawiwwwhindawicom Volume 2018

Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of

Hindawiwwwhindawicom Volume 2018

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawiwwwhindawicom Volume 2018

OptimizationJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

Hindawiwwwhindawicom Volume 2018

Operations ResearchAdvances in

Journal of

Hindawiwwwhindawicom Volume 2018

Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences

Hindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018

Hindawiwwwhindawicom Volume 2018

Decision SciencesAdvances in

Hindawiwwwhindawicom Volume 2018

AnalysisInternational Journal of

Hindawiwwwhindawicom Volume 2018

Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom