xihui chen, sjouke mauw, and yunior ramírez-cruz

Proceedings on Privacy Enhancing Technologies ; 2020 (4):131–152

Xihui Chen, Sjouke Mauw, and Yunior Ramírez-Cruz*

Publishing Community-Preserving AttributedSocial Graphs with a Differential PrivacyGuaranteeAbstract: We present a novel method for publishingdifferentially private synthetic attributed graphs. Ourmethod allows, for the first time, to publish syntheticgraphs simultaneously preserving structural properties,user attributes and the community structure of theoriginal graph. Our proposal relies on CAGM, a newcommunity-preserving generative model for attributedgraphs. We equip CAGM with efficient methods forattributed graph sampling and parameter estimation.For the latter, we introduce differentially private com-putation methods, which allow us to release community-preserving synthetic attributed social graphs with astrong formal privacy guarantee. Through comprehen-sive experiments, we show that our new model outper-forms its most relevant counterparts in synthesising dif-ferentially private attributed social graphs that preservethe community structure of the original graph, as wellas degree sequences and clustering coefficients.

Keywords: attributed social graphs, generative models,differential privacy, community detection

DOI 10.2478/popets-2020-0066Received 2020-02-29; revised 2020-06-15; accepted 2020-06-16.

1 IntroductionThe use of online social networks (OSNs) has grownsteadily during the last years, and is expected to con-tinue growing in the future. The ubiquity of OSNs hasturned them into one of the most important sources ofdata for the analysis of social phenomena. Such anal-yses have led to significant findings used in a wide

Xihui Chen: SnT, University of Luxembourg, Esch-sur-Alzette, Luxembourg, E-mail: [email protected] Mauw: DCS, University of Luxembourg, Esch-sur-Alzette, Luxembourg, E-mail: [email protected]*Corresponding Author: Yunior Ramírez-Cruz: SnT,University of Luxembourg, Esch-sur-Alzette, Luxembourg,E-mail: [email protected]

range of applications, from efficient epidemic diseasecontrol [8, 36] to information diffusion [21, 66]. Despitethe social benefits that can be obtained from social net-work analysis, access to social data by third parties suchas researchers and companies must be limited due to thesensitivity of the information stored in OSNs, e.g. per-sonal relationships, political preferences and religious af-filiations. In addition, the increase of public awarenessabout privacy and the entry into effect of strong privacyregulations such as the GDPR [1] strengthen the reluc-tance of OSN owners to release part of the data theyhold. Therefore, it is of critical importance to providemechanisms for privacy-preserving social data publica-tion to encourage OSN owners to release data for anal-ysis and to provide strong privacy guarantees to theirusers.

Social graphs are a natural representation of socialnetworks, with nodes corresponding to participants andedges to connections between participants. A large num-ber of methods have been devised for publishing sani-tised versions of the original social graph [5, 7, 9, 10, 33–35, 38, 39, 50, 52, 53, 58, 65, 67], computing graph statis-tics in a privacy-preserving manner [15, 24, 63], or re-leasing synthetic graphs that preserve properties of theoriginal graph while protecting user’s private informa-tion [16, 42, 51, 54, 59]. Among these methods, differen-tial privacy (DP) [13] has gained enormous popularitydue to its strong privacy guarantees and the fact that itis a semantic privacy notion, focusing on the data pro-cessing algorithms rather than specific datasets or typesof adversary knowledge. According to the type of pub-lished data, we can divide differentially private mech-anisms for social graphs into two classes. The meth-ods in the first class directly release specific statisticsof the social graph, e.g. the degree sequence [15, 24]or the number of specific subgraphs (triangles, stars,etc.) [63]. The second family of methods focuses on pub-lishing synthetic social graphs as a replacement of realsocial networks in a two-step process [42, 51, 54, 59].In the first step, differentially private methods are usedto compute the parameters of a generative model thataccurately captures the original graph properties. Then,in the second step, synthetic graphs are sampled from

Publishing Community-Preserving Attributed Social Graphs with a Differential Privacy Guarantee 132

the private model. DP requires to define in advance aprivacy budget, which determines the amount of pertur-bation that will be applied to the outputs of algorithms.In consequence, methods in the first family need to ei-ther limit in advance the number of queries to answer ordeliver increasingly lower quality answers. On the con-trary, methods in the second family can devote the en-tire privacy budget to the model parameter estimation,without further degradation of the sampled graphs. Forthis reason, in this paper we focus on the second typeof methods.

For analysts, the utility of synthetic graphs is de-termined by the ability of the graph models to capturerelevant properties of the original graph. To satisfy thisneed, several graph models have been proposed to accu-rately capture global structural properties such as de-gree distributions and clustering coefficients, as well asheterogeneous user attributes such as gender, educationor marital status. However, the best differentially pri-vate models proposed so far fail to capture the commu-nity structure of the original graph. Informally, a com-munity is a set of users who are much more interrelatedamong themselves than to other users of the network.Examples of communities are: a group of Gmail userswho frequently e-mail each other, or a group of employ-ees of the same company. The emergence of communi-ties has been shown to be an inherent property of socialnetworks [49, 61]. Analysts would tremendously bene-fit from the availability of synthetic attributed graphsthat preserve the community structure of the originalgraph. For example, they may be able to improve onlineshopping recommendations based on the common pur-chases of users belonging to the same community. Cur-rent models and methods are insufficient for enablingsuch an analysis, as they either lack information aboutthe community structure or they lack vertex features.

In this paper, we address the problem discussedin the previous paragraph by introducing, for the firsttime, a generative graph model that simultaneouslypreserves global network properties, user attributesand community structure. Our model is called CAGM(Community-Preserving Attributed Graph Model). It isequipped with efficient parameter estimation and graphsampling methods. Furthermore, we provide differen-tially private variants of the CAGM’s parameter estima-tion methods, which allow us to release synthetic at-tributed social graphs with a strong privacy guaran-tee and increased utility with respect to preceding ap-proaches.Summary of contributions:

– We introduce CAGM, the first generative attributedgraph model that simultaneously captures a numberof properties of the community structure, along withuser attributes and structural properties.

– We present efficient methods for learning an in-stance of our model from an input graph and sam-pling community-preserving synthetic attributedgraphs from this instance. We show, via a numberof experiments on real-world social networks, thatthe community structures of synthetic graphs sam-pled from our model are more similar to those ofthe original graphs than those of the graphs sam-pled from previous models. Additionally, we showthat this behaviour is obtained without sacrificingthe ability to preserve global structural features.

– We devise differentially private methods for com-puting the parameters of the new model. We empir-ically demonstrate that differentially private syn-thetic attributed graphs generated by our modelsuffer a reasonably low degradation with respect totheir counterparts, in terms of their ability to cap-ture the community structure and structural fea-tures of the original graphs.

2 Related WorkPrivate graph synthesis. The key to synthesising so-cial graphs is the model which determines both the in-formation embedded in the published graphs and theproperties preserved. Mir et al. [42] used the Kroneckergraph generative model [30] to generate differentiallyprivate graphs. As the Kronecker model cannot accu-rately capture structural properties, Sala et al. [51] pro-posed an alternative approach which makes use of thedK-graph model, which is based on counting the occur-rences of specific subgraphs with K vertices (e.g. length-K paths). Wang et al. [54] further improved the workof Sala et al. by considering global sensitivity instead oflocal sensitivity (refer to Sect. 3 for the definition of sen-sitivity). Xiao et al. [59] used the hierarchical randomgraph (HRG) model [12] and found that it can furtherreduce the amount of added noise and thus increasethe accuracy. Recently, Zhang et al. [64] proposed anedge plausibility measure to help decide which edgesseem more natural when added to a graph as a partof an anonymisation method. They applied this mea-sure as an extension to Sala et al.’s method, in par-ticular to its implementation included in the SecGraphlibrary [19]. Zhang et al. pointed out that this imple-


mentation mostly adds fake edges to the original graph,rather than synthesising a new graph from scratch.

The approaches described so far work on unlabelledgraphs. Pfiffer et al. [18] introduced the attributed graphmodel (AGM) which attaches binary attributes to nodesand captures the correlations between shared attributesand the existence of connections. Jorgensen et al. [20]adopted this model and proposed differentially privatemethods to accurately estimate the model parameters.They also designed a new graph generation methodbased on the transitive Chung-Lu (TCL) model [17],which enables the model to sample attributed graphspreserving the clustering coefficient. As discussed previ-ously, CAGM, the model introduced in this paper, is com-parable to Jorgensen et al.’s model in preserving globalstructural properties of the original graph, but it out-performs it by also capturing the community structure.Private statistics publishing. Degree sequences anddegree correlations are two types of the statistics fre-quently studied in the literature. The general trend inpublishing these statistics under DP consists in addingnoise to the original sequences and then post-processingthe perturbed sequences to enforce or restore certainproperties, such as graphicality [24], vertex order interms of degrees [15], etc. Subgraph count queries, e.g.the number of triangles or k-stars, have also receivedconsiderable attention. Among the approaches to ac-curately compute such queries, we have ladder func-tions [63] and smooth sensitivity [23, 55].

The aforementioned approaches have focused on un-labelled (non-attributed) graphs. The task of publishinggraph statistics from attributed social networks was ad-dressed in [32]. This work focuses on studying the ef-fect of dependences between tuples on differentially pri-vate methods for computing the degree sequence of agraph. They consider node attributes to be public in-formation, and use the overlap between node attributesas a model of dependences between nodes. Then, theyapply the stronger notion of dependent differential pri-vacy (DDP), also introduced in the paper, to computedegree sequences in the presence of an adversary whoknows the true values of node attributes. Our work dif-fers from the one presented in [32] in the fact that wedo not treat node features as public information, butas sensitive information, and apply DP for estimatingattribute distributions. Regarding the applicability ofDDP in our approach, a case where it may be applica-ble is that of an adversary who can extract tuple de-pendencies from additional public knowledge, e.g. loca-tion traces which are obtainable from check-ins in socialnetwork posts. While being an interesting direction for

future work, such an extension goes beyond the scope ofthis work, since it would require new DDP mechanismsfor estimating all the parameters of our model under ev-ery possible scenario in terms of adversary knowledge.Community-preserving graph generation mod-els. A number of existing random graph models claimto capture community structure, e.g., block two-levelErdős-Rényi (BTER) [26], ILFR [49], stochastic blockmodel (SBM) [56] and its variants (e.g., DCSBM [22]and DCPPM [45]), attributed networks with communi-ties generator (ANCG) [29] and DANCer [28]. BTER gener-ates community-preserving social graphs given expectednode degrees and, for every degree value σ, the averageof the clustering coefficients of the nodes of degree σ.The model assumes that every community is a set ofσ nodes with degree σ. On the contrary, CAGM makesno assumptions on the community partition received.Moreover, ILFR and the variants of SBM preserve edgedensities at the community level but, unlike our newmodel, they do not preserve the clustering coefficientsof the original graph. Finally, ANCG and DANCer aim tosynthesise graphs satisfying some known behaviours ofcomplex networks such as small world, preferential at-tachment and homophily, extending the behaviour ofthe classic Barabási-Albert model [2]. Although thesemodels take node attributes and community structureinto account, they only provide generation methodsgiven user-defined parameters. Since their goals do notinclude learning model parameters from existing graphs,they do not provide estimation methods for model pa-rameters and, consequently, privacy preservation mech-anisms are not necessary for their application scenario.Community-enhanced de-anonymisation andcommunity-preserving anonymisation. A num-ber of works on user de-anonymisation have shownthat knowledge about communities can be exploitedby an adversary to improve the re-identification ofpseudonymised users [14, 47, 57]. The anonymisationscenario in which these attacks are applied differs fun-damentally from the one described in this paper. Ourapproach views node attributes and edges as private,but considers vertex ids to be public. As a consequence,synthetic graphs generated by our mechanism sharethe same vertex set of the original graph. Our pur-pose is not to hide users’ identities. Therefore, in thisscenario, de-anonymisation is not a threat. Previousworks reported in the literature additionally supportour rationale that the goals of graph anonymisationand community structure preservation are not nec-essarily in contradiction. For example, edge editionoperations that better preserve community structures


of the original graph during anonymisation are studiedin [6], whereas a community-preserving anonymisa-tion method is proposed in [50]. A similar behaviouris shown by synthetic graphs generated by our ap-proach, when node ids are pseudonymised. Interestedreaders can refer to App. E, where we empiricallyshow that pseudonymised versions of our syntheticgraphs also resist the strongest community-enhancedde-anonymisation attack proposed in the literature [47].

3 Preliminaries

3.1 Notations

An attributed graph is represented as a triple G =(V, E , X), where V = {v1, v2, . . . , vn} is the set of nodes,E ⊆ V × V is the set of edges, and X is a binary ma-trix called the attribute matrix. The i-th row of X is theattribute vector of vi, which is individually denoted byτ(vi). Every column of X represents a binary feature,which is set to 1 (true), or 0 (false), for each user.For example, if the j-th column represents the attribute“owning a car”, Xij = 1 means that the user repre-sented by vi owns a car. Non-binary real-life attributesare assumed to be binarised. The order of the columnsof X is fixed, but arbitrary, and has no impact on theresults described hereafter. Throughout the paper, wedeal with undirected graphs. That is, if (vi, vj) ∈ E ,then (vj , vi) ∈ E . Additionally, we use A to denote theadjacency matrix of the graph.

We use C = {C0, C1, . . . , Cp}, with Ci ⊆ V for everyi ∈ {0, 1, . . . , p}, to represent a community partition ofthe attributed graph. As the term suggests, in this pa-per we assume that Ci∩Cj = ∅, with 0 ≤ i < j ≤ p, and∪Ci∈CCi = V. The community C0 has a special inter-pretation. Since some community detection algorithmsassign no community to some vertices, we will use C0as a “discard” community of unassigned vertices. We doso to avoid having a potentially large number of single-ton communities, for which no meaningful co-affiliationstatistics can be computed. We use ψC(vi) to denote thecommunity to which the node vi belongs in the commu-nity partition C. If C is clear from the context, we willsimply use ψ(vi). Table 1 summarises the most impor-tant notation used throughout the paper.

3.2 Differential Privacy

Differential privacy [13] is a well studied statistical no-tion of privacy. The intuition behind it is to randomisethe output of an algorithm in such a way that the pres-

Table 1. Important notations.

G = (V, E, X) An attributed graph GV Set of nodes of GE Set of edges of GX Attribute matrix of Gτ(v) Attribute vector of v ∈ VC Community partition of GψC(v) Community to which v belongs in Cβ(τ(v), τ(w)) Descriptor of the edge (v, w) in terms

of τ(v) and τ(w)M = An instance of CAGM〈V, C,Θc

M ,ΘcX ,Θ

cF 〉

ΘcM Edge generation model

ΘcX Attribute vector generation model

ΘcF Attribute-edge correlations generation

modeldintra(v) Intra-community degree of vdinter(v) Inter-community degree of vnintra

4 Number of intra-community trianglesninter

4 Number of inter-community trianglesn`

C Number of nodes in C ∈ C whose `-thattribute has value 1

ruC Number of intra-community edges

(v, w) in C ∈ C such thatβ(τ(v), τ(w)) = u

ruinter Number of inter-community edges

(v, w) such that β(τ(v), τ(w)) = u

ence of any individual element in the input datasethas a negligible impact on the probability of observ-ing any particular output. In other words, a mechanismis ε-differentially private if for any pair of neighbouringdatasets, i.e. datasets that only differ by one element,the probabilities of obtaining any output are measur-ably similar. The amount of similarity is determined bythe parameter ε, which is commonly called the privacybudget. In what follows, we will use the notation D forthe set of possible datasets, O for the set of possible out-puts, and D ∼ D′ for a pair of neighbouring datasets.

Definition 1 (ε-differential privacy [13]). A ran-domised mechanism M : D → O satisfies ε-differentialprivacy if for every pair of neighbouring datasetsD,D′ ∈ D, D ∼ D′, and for every S ⊆ O, we havePr(M(D) ∈ S) ≤ eε Pr(M(D′) ∈ S).

A number of differentially private mechanisms havebeen proposed. For queries of the form q : D → Rn,the most widely used mechanism to enforce differen-tial privacy is the Laplace mechanism, which consistsin obtaining the (non-private) output of q and addingto every component a carefully chosen amount of ran-dom noise, which is drawn from the Laplace distributionLap(λ) : f(y | λ) = 1

2λ exp(−|y|λ ), where y is a real-valued


variable indicating the noise to be added, λ = ∆q

ε and∆q is a property of the original function q called globalsensitivity. This property is defined as the largest differ-ence between the outputs of q for any pair of neighbour-ing datasets, that is ∆q = maxD∼D′ ‖q(D)− q(D′)‖1,where ‖·‖1 is the L1 norm. For categorical queries of theform q : D → O, where O is a finite set of categories, theexponential mechanism [41] is the most commonly used.In this case, for each value o ∈ O, a score is assigned bya function (usually called scoring function) quantifyingthe value’s utility, denoted by u(o,D). The global sensi-tivity of u is ∆u = maxo∈O,D∼D′ |u(o,D) − u(o,D′)|,and the randomised output is drawn with probabil-ity proportional to exp( ε·u(o,D)

2∆u). Differentially private

methods are composable [40], and deterministic post-processing of the output of an ε-differentially privatealgorithm also satisfies ε-differential privacy [25]. Theseproperties allow us to divide a complex computation,such as the set of model parameters, into a sequence ofsub-tasks for which differentially private methods existor can be more easily developed.

4 The CAGM ModelIn this section we give the formal definition of CAGM.We introduce the methods for sampling synthetic graphsfrom the model, and describe the methods for learningthe model parameters from an attributed graph.

4.1 Overview

Alg. 1 summarises the process by which CAGM is used forpublishing synthetic attributed graphs. A thorough de-scription of the parameters of CAGM is given in Sect. 4.2,and parameter estimation is discussed in Sect. 4.4. Oncethe model parameters have been estimated, we can sam-ple any number of synthetic attributed graphs from themodel, as described in steps 3 to 7 of Alg. 1. Thus,the synthetic graphs generated by Alg. 1 have the samevertex set as the original graph, whereas the attributematrix and the edge set are sampled from the model(step 6). For every new synthetic attributed graph, wefirst sample the attribute matrix, and then this matrixis used, in combination with an edge generation model(Sect. 4.3.1), to generate the edge set of the syntheticgraph. There are two reasons for dividing this processinto two steps. The first one is to make the samplingprocess efficient. The second reason is to profit fromthe two-step process to enforce the intuition that userswith features of certain patterns are more likely to be

Algorithm 1: Given G = (V, E , X), obtain t

attributed synthetic graphs.1 Obtain community partition;2 Estimate CAGM parameters;3 for i ∈ {1, 2, . . . , t} do4 Sample Xi from CAGM;5 Sample Ei from CAGM;6 Gi ← (V, Ei, Xi)

connected in the social network. The attributed graphsampling procedure is discussed in detail in Sect. 4.3.To conclude, note that if steps 1 and 2 are executedin a differentially private manner, the synthetic graphsobtained by steps 3 to 7 also satisfy DP, because thegeneration is done as a post-processing step of the dif-ferentially private computation.

4.2 Model Parameters

As we discussed in Sect. 1, given an attributed graphG and a community partition C of G, the purpose ofCAGM is to capture a number of properties of C that areoverlooked by previously defined models, without sacri-ficing the ability to capture global structural propertiessuch as degree distributions and clustering coefficients.To that end, CAGM models the following properties ofthe community partition:

1. the number and sizes of communities;2. the number of intra-community edges in every

community;3. the number of inter-community edges;4. the distributions of attribute vectors in every com-

munity;5. the distributions of the so-called attribute-edge cor-

relations [20], for the set of inter-community edgesand for the set of intra-community edges in everycommunity.

Graphs generated by CAGM will have the same num-ber of vertices as the original graph, as well as thesame number of communities. Moreover, every commu-nity will have the same cardinality as in the originalgraph, and the same number of intra-community edges.The number of inter-community edges of the gener-ated graph will also be the same as that of the orig-inal graph. Notice that the model preserves the totalnumber, but not necessarily the pairwise numbers ofinter-community edges for every pair of communities.


Attribute-edge correlations were defined in [20] asheuristic values for characterising the relation betweenthe feature vectors labelling a pair of vertices and thelikelihood that these vertices are connected. They en-code the intuition that, for example, co-workers who at-tended the same university and live near to each otherare more likely to be friends than persons with fewerfeatures in common, whereas friends are more likely tosupport the same sports teams or go to the same barsthan unrelated persons. In our model, we compute onedistribution of attribute-edge correlations for each com-munity, as well as an inter-community distribution.

A key element in the representation of attribute-edge correlations is the notion of aggregator functions.An aggregator function β : {0, 1}k×{0, 1}k → B maps apair of attribute vectors x, x′ of dimensionality k into avalue in a discrete range B, which is used as a descrip-tor, also called aggregated feature, of the pair (x, x′).For example, B can contain a set of similarity levels forpairs of feature vectors, such as {low, medium, high},and β can map a pair of vectors whose cosine similarityis in the interval [0, 0.33] to low, a pair of vectors whosecosine similarity is in the interval [0.67, 1] to high, etc.Attribute-edge correlations, along with the community-wise distributions of attribute vectors, are useful for an-alysts, as they allow to characterise the members of acommunity in terms of frequently shared features, hy-pothesise explanations for the emergence of a commu-nity, etc.

Formally, an instance of the CAGM model is definedas a quintupleM = 〈V, C,Θc

M ,ΘcX ,Θc

F 〉, where:– V is a set of vertices.– C is a community partition of V.– Θc

M is an instance of an edge set generative modelthat preserves properties 1 to 3 of the communitypartition C, as well as degree distributions and clus-tering coefficients.

– ΘcX is an instance of an attribute vector generation

model, which aims to preserve property 4 of thecommunity partition. The model defines, for everyattribute vector x, every C ∈ C and every v ∈ C,the probability Pr(τ(v) = x |C,Θc

X) that a vertex inC is labelled with x.

– ΘcF is an instance of a generative model for

attribute-edge correlations, which aims to preserveproperty 5 of the community partition. This modeldefines:– The discrete range B and aggregator function β.– The probability

Pr(β(τ(vi), τ(vj)) = u |ψC(vi) = ψC(vj) = C,

Ai,j = 1,ΘcF )

for every community C ∈ C and every u ∈ B.– The probability

Pr(β(τ(vi), τ(vj)) = u |ψC(vi) 6= ψC(vj), Ai,j = 1,ΘcF )

for every u ∈ B.

4.3 Sampling Attributed Graphs from anInstance of CAGM

Given an instance G = 〈V, C,ΘcM ,Θc

X ,ΘcF 〉 of CAGM,

with V = {v1, v2, . . . , vn}, an attributed graph G =(V, E , X) is sampled from G with probability Pr(G | G) =Pr(E , X | G) which, for tractability, is approximated as

Pr(E, X |C,ΘcM ,ΘcF ,Θ

cX) = Pr(E |X, C,ΘcM ,Θ

cF ) · Pr(X |C,ΘcX).

That is, we first sample from ΘcX the attribute vectors

labelling each vertex and then use them in sampling theedge set. Again, for tractability, we make

Pr(X |C,ΘcX) =

∏vi∈V

Pr(τ(vi) = xi |ψC(vi),ΘcX),

where xi is the i-th row of X. Probabilities of the formPr(τ(vi) = xi |ψC(vi),Θc

X) are estimated from the inputgraph, as will be described in Sect. 4.4. Then, under theassumption that edges are sampled independently, weapproximate Pr(E | X, C,Θc

M ,ΘcF ) in terms of attribute-

edge correlations as follows:

Pr(E |X, C,ΘcM ,ΘcF ) =∏

vi,vj∈V

Pr(Ai,j = 1 |β(τ(vi), τ(vj)), C,ΘcM ,ΘcF ).

An efficient two-step procedure was presented in [18]for sampling edges from the distribution Pr(Ai,j = 1 |β(τ(vi), τ(vj)),Θc

M ,ΘcF ). Here, we adapt this procedure

for sampling edges from the distribution Pr(Ai,j = 1 |ΘcM ,Θc

F , β(τ(vi), τ(vj)), C). In the first step, a candidateedge (vi, vj) is sampled from the edge generation modelΘcM (which does not account for node features) with

probability Q′M (i, j) = Pr(Ai,j=1|C,ΘcM )∑

vp,vq∈VPr(Ap,q=1|C,Θc

M), where

the probabilities of the form Pr(Ai,j = 1 | C,ΘcM ) are

estimated from the input graph. In the second step,the candidate edge is accepted as an edge of G withprobability Γ(β(τ(vi), τ(vj)), C), which is computed intwo different manners depending on the community co-affiliation of vi and vj . If ψC(vi) = ψC(vj) = C, we firstcompute

Rintra(β(τ(vi), τ(vj)), C) =Pr(β(τ(vi),τ(vj))|ψC(vi)=ψC(vj)=C,Ai,j=1,Θc

F )PrM (β(τ(vi),τ(vj))|ψC(vi)=ψC(vj)=C,Ai,j=1,Θc

M,Θc

F) ,


Algorithm 2: SampleFromCAGM(V, C,ΘcM ,Θc

X ,ΘcF )

1 X′ ← SampleAttributeVectors(ΘcX);2 Q′M ← ComputeQM(ΘcM , C);3 E ′ ← SampleEdgeSet(Q′M );4 for s ∈ B do5 Compute Γinter (s);6 for C ∈ C do7 Compute Γintra(s, C)8 E ′ ← ∅;9 while |E ′| < |E| do

10 (v, w)← SampleEdge(Q′M );11 s← β(τ(v), τ(w));12 u← Uniform(0, 1);13 if (ψC(v) = ψC(w) ∧ u ≤

Γintra(s, ψC(v)) or (ψC(v) 6= ψC(w) ∧ u ≤Γinter (s) then

14 E ′ ← E ′ ∪ {(v, w)};15 return X′, E ′;

which represents the ratio between the probability thatan edge joining two nodes in community C of the inputgraph is described by β(τ(vi), τ(vj)) and the probabilitythat an edge generated by Θc

M , and joining two nodesin C, is described by β(τ(vi), τ(vj)). Then, we make

Γ(β(τ(vi), τ(vj)), C) = Γintra(β(τ(vi), τ(vj)), C) =

=Rintra(β(τ(vi), τ(vj)), C)

SupR.

On the contrary, if ψC(vi) 6= ψC(vj), we first compute

Rinter (β(τ(vi), τ(vj))) =Pr(β(τ(vi),τ(vj))|ψC(vi) 6=ψC(vj),Ai,j=1,Θc

F )PrM (β(τ(vi),τ(vj))|ψC(vi) 6=ψC(vj),Ai,j=1,Θc

M,Θc

F) ,

which represents the ratio between the probability thatan edge joining two nodes from different communities inthe input graph is described by β(τ(vi), τ(vj)) and theprobability that an edge generated by Θc

M , and joiningtwo nodes from different communities, is described byβ(τ(vi), τ(vj)). Then, we make

Γ(β(τ(vi), τ(vj)), C) = Γinter (β(τ(vi), τ(vj))) =

=Rinter (β(τ(vi), τ(vj)))

SupR.

In computing the values of both Γintra and Γinter ,

SupR = sup∪s∈B,C∈C (Rintra(s, C) ∪Rinter (s)) .

Alg. 2 describes the procedure to sample an at-tributed graph from CAGM.

4.3.1 Edge generation model

As we discussed in Sect. 4.2, the component ΘcM of

CAGM is an edge generation model which preserves sev-

eral properties of the community partition of the origi-nal graph (properties 1 to 3 listed in Sect. 4.2), in ad-dition to the degree distribution and clustering coeffi-cients. We call this model Community-preserving GraphModel (CPGM). It is an extension of the TriCycle model,introduced in [20], that takes community structure intoaccount. CPGM takes as input the set of vertices, as wellas the expected number of neighbours of every vertex vwithin its community (that is, its intra-community de-gree, denoted by dintra(v)) and the expected number ofneighbours outside its community (that is, the inter-community degree, denoted by dinter (v)). These valuesare used to enforce the expected densities within ev-ery community and between communities. Additionally,the model also requires the number of triangles hav-ing all vertices in one community (which we call intra-community triangles and denote by nintra

4 ), as well as thenumber of triangles spanning more than one community(inter-community triangles, denoted by ninter

4 ). It wasshown in [20] that synthetic graphs that preserve thenumber of triangles of the original graph are more likelyto approximate its clustering coefficient. Moreover, bothnintra4 and ninter

4 can be efficiently and accurately com-puted under DP. In CPGM, the edge generation processconsists of two steps. The first step produces a graphthat preserves the intra- and inter-community degrees,but not the number of intra- and inter-community trian-gles. Then, the second step iteratively edits the originaledge set until nintra

4 and ninter4 are approximately en-

forced, within a tolerance threshold. The second stepis based on the observation presented in [17] that theclustering behaviour in social networks stems from thehigher likelihood of users with common friends to con-nect to each other, thus creating triangles.

At the first step, we follow the idea of theCL model [11]. For every pair of vertices v and w

satisfying ψC(v) = ψC(w) = C, the intra-communityedge (v, w) is added with probability πintra

C (v, w) =dintra(v)dintra(w)

2mintraC

, where mintraC is the original number of

intra-community edges in C. That is, intra-communityedges are added with a probability proportional to theproduct of the intra-community degrees of the linkedvertices. If ψC(v) 6= ψC(w), then the inter-communityedge (v, w) is added with probability πinter (v, w) =dinter (v)dinter (w)

2minter , where minter is the total number ofinter-community edges in the original graph. Alg. 3 de-scribes the first step of the generation process. Thesecond step of the generation process consists in iter-atively applying edge swaps until the numbers of intra-and inter-community triangles are approximately en-forced, within a 98% tolerance window. Every edge swap


Algorithm 3: GenInitialEdgeSet(dintra, dinter , C)1 E ← ∅;2 for C ∈ C do3 mintra

C ← 12∑

v∈C dintra(v);4 m← 0;5 while m ≤ mintra

C do6 (v, w)← Sample(πintra

C );7 if (v, w) /∈ E then8 E ← E ∪ {(v, w)};9 m← m+ 1;

10 minter ← 12∑

v∈V dinter (v);11 while m ≤

∑C∈Cm

intraC +minter do

12 (v, w)← Sample(πinter );13 if (v, w) /∈ E then14 E ← E ∪ {(v, w)};15 m← m+ 1 ;

consists in first adding a new edge and then remov-ing another. If the added edge is an intra-communityedge, then the oldest intra-community edge (in termsof the order of creation by Alg. 3) is removed, evenif it does not belong to the same community. Like-wise, if the added edge is an inter-community edge,then the oldest inter-community edge is removed. In themethod, all intra-community edge swaps are executedbefore inter-community edge swaps. The reason for thisis that adding or removing an intra-community edgemay change the number of inter-community trianglesas well, whereas inter-community triangles can be cre-ated without modifying the number of intra-communitytriangles. Edge swaps resulting in a reduction in thenumber of intra- or inter-community triangles are dis-carded, in which case the edge selected for removal is setto be the youngest in the graph so it is not selected againin a sufficiently large number of coming iterations. Toreduce the likelihood of discarding edge swaps, the edgeselected for addition must be one that transforms atleast one wedge (a length-3 path not forming a triangle)into an intra- or inter-community triangle, as required.The edge edition method is described in Alg. 4. In thealgorithm, we denote by Nintra(v) the set of neighboursof v in its community, that is Nintra(v) = {w | ψC(v) =ψC(w) ∧ (v, w) ∈ E}. Likewise, we denote by Ninter (v)the set of neighbours of v in different communities, thatis Ninter (v) = {w | ψC(v) 6= ψC(w) ∧ (v, w) ∈ E}.

Due to the removal of initially generated edges, thesynthetic graph may become disconnected. In this case,we apply an edge-swapping post-processing step to re-connect every small connected component to the maincomponent (the connected component with the mostnodes). If the post-processing reduces the number of

Algorithm 4: GetFinalEdgeSet(dintra, dinter , nintra4 ,

ninter4 , C)

1 µintra4 ← CountIntraCommTriangles(E);

2 while µintra4 < nintra

4 do3 Uniformly sample C from C;4 Sample v1 from C with probability dintra(v1)

2mintraC

;

5 Uniformly sample v2 from Nintra(v1);6 Uniformly sample v3 from Nintra(v2);7 if (v1, v3) 6∈ E ∧ v3 6= v1 then8 (v′1, v′2)← GetOldestIntraCommEdge(E, C);9 nprev

cn ← GetCommonNeighbour(v′1, v′2);10 E ← E/{(v′1, v′2)};11 nnew

cn ← GetCommonNeighbour(v1, v3);12 if nprev

cn < nnewcn then

13 E ← E ∪ {(v1, v3)};14 µintra

4 ← µintra4 − nprev

cn + nnewcn ;

15 else16 E ← E ∪ {(v′1, v′2)};17 µinter

4 ← CountInterCommTriangles(E);18 while µinter

4 < ninter4 do

19 Sample v1 from V with probability dinter (v1)2minter ;

20 Uniformly sample v2 from Ninter (v1);21 Uniformly sample v3 from Nintra(v2);22 (v′1, v′2)← GetOldestInterCommEdge(E, C);23 nprev

cn ← GetCommonNeighbour(v′1, v′2);24 E ← E/{(v′1, v′2)};25 nnew

cn ← GetCommonNeighbour(v1, v3);26 if nprev

cn < nnewcn then

27 E ← E ∪ {(v1, v3)}µinter4 ← µinter

4 − nprevcn + nnew

cn ;28 else29 E ← E ∪ {(v′1, v′2)};

triangles, we recall Alg. 4. The alternation between thepost-processing and Alg. 4 is not guaranteed to yield agraph having exactly the required number of triangles,so we stop the iteration when the total number of tri-angles in the synthetic graph is within a 98% tolerancewindow with respect to the original one. We also notethat Alg. 4 is not guaranteed to enforce nintra

4 and ninter4

in all cases. This issue is inherited from TriCycle, but itdoes not occur in realistic social graphs as the ones usedin our experiments, neither in those used in [20].

4.4 Parameter Estimation for CAGM

Estimating ΘcM . The estimation of Θc

M reduces tocomputing the community-wise counters that it relieson: intra- and inter-community degrees of every vertex,the number of intra-community triangles for each com-munity and the number of inter-community triangles.As we mentioned in Sect. 4.3.1, degrees and trianglecounts are used to preserve global structural properties


of the generated graphs such as degree distribution andclustering coefficients. They can be efficiently computedon the original graph both exactly and under DP.Estimating Θc

X . In order to keep the estimation pro-cedure tractable, we introduce the assumption that at-tributes are independent. While not realistic in all sce-narios, this assumption simplifies the estimation andhandles the sparsity of the attribute vectors when thenumber of attributes is large. As seen in [18, 20], nothaving such an assumption severely limits the numberof features that can be practically handled. We will de-note by x` be the value for the `-th component of theattribute of vector x. Likewise, we will denote by τ`(v)the value of the `-th component of the vector labellingvertex v. We estimate the probability that a node v islabelled with an attribute vector x by the following for-mula:

Pr(τ(v) = x |ψC(v),ΘcX) =∏k

`=1 Pr`(τ`(v) = x` | ψC(v),ΘcX),

where k is the number of columns of X (ergo the cardi-nality of all attribute vectors) and

Pr`(τ`(v) = x` | ψC(v),ΘcX) = |{v′∈ψC(v)|τ`(v′)=x`}||ψC(v)| .

Estimating ΘcF . As we discussed in Sect. 4.2, for defin-

ing ΘcF it is necessary to define an aggregator function

for pairs of attribute vectors. Our aggregator functionis based on the widely used cosine similarity, that is,the cosine of the angle between the two feature vectors.Since the range B of aggregator functions needs to bediscrete, we split the range [0, 1] of the cosine similar-ity into a set of intervals, determined by a parameter δsatisfying 0 < δ ≤ 1. Let scos(x, x′) denote the similar-ity between vectors x and x′. Our aggregator functionis defined as β(x, x′) =

⌊scos(x,x′)

δ

⌋. Note that, accord-

ing to this definition, B = {b sδ c | s ∈ [0, 1]}. Finally,the probability that an aggregated feature u ∈ B char-acterises a pair of connected vertices vi, vj satisfyingψC(vi) = ψC(vj) = C in the input graph is computed as

Pr(β(τ(vi), τ(vj)) = u |ψC(vi) = ψC(vj) = C,Ai,j = 1,ΘcF ) =|{(vp, vq) ∈ E | β(τ(vp), τ(vq)) = u ∧ ψ(vp) = ψ(vq) = C}|

|{(vp, vq) ∈ E | ψ(vp) = ψ(vq) = C}|.

Analogously, for a pair of connected vertices vi, vj sat-isfying ψC(vi) 6= ψC(vj), we make

Pr(β(τ(vi), τ(vj)) = u |ψC(vi) 6= ψC(vj), Ai,j = 1,ΘcF ) =| {(vp, vq) ∈ E | β(τ(vp), τ(vq)) = u ∧ ψ(vp) 6= ψ(vq)} |

| {(vp, vq) ∈ E | ψ(vp) 6= ψ(vq) |.

Compared to the approach introduced in [18, 20],our model uses a coarser granularity for aggregated fea-tures. Thanks to that, it avoids computing 22k differentprobability values, which is extremely inefficient.

5 Differentially Private CAGMAs we discussed in Sect. 3, the difference between in-stantiations of DP for graphs lies in the definition ofthe graph pairs considered to be neighbouring datasets.Here, we adopt the following definition from [20].

Definition 2 (Neighbouring attributed graphs [20]).A pair of attributed graphs G = (V, E , X) and G′ =(V, E ′,X ′) are neighbouring, denoted G ∼at G′, if andonly if they differ in the presence of exactly one edge orthe attribute vector of exactly one node. That is,

G ∼at G′ ⇐⇒ |E∇E ′| =1 ∨ (∃v∈V τG(v) 6= τG′(v) ∧∧ ∀v′∈V\{v} τG(v) = τG′(v)).

Def. 2 entails that the existence of relations, that is theoccurrence of edges, and the attributes describing everyparticular user, are treated as sensitive. On the contrary,vertex identifiers are treated as non-private. These cri-teria are in line with the current privacy policies of mostsocial networking sites, where the fact that a profile ex-ists is public information, but users can keep their per-sonal information and friends list private or hidden fromthe general public. With Def. 2 in mind, we describe inwhat follows the differentially private computation ofevery parameter of CAGM. Due to space limitations, wedevelop all the proofs in the appendices.

5.1 Obtaining the Community Partition

Our differentially private community partition methodextends the algorithm ModDivisive [46], in such a waythat it takes node attributes into account. ModDivisivesearches for a community partition that maximisesmod-ularity, a structural parameter encoding the intuitionthat a user tends to be more connected to users inthe same community than to users in other communi-ties [44]. Modularity is defined as

∑C∈C

(`Cm −

(dC2m)2),

where `C is the number of edges between the nodes in Cand dC is the sum of degrees of the nodes in C. ModDivi-sive uses the exponential mechanism, considering the setof possible partitions as the categorical co-domain, andusing modularity as the scoring function. In order to in-tegrate node features into ModDivisive, we introduce anew objective function that combines the original modu-larity with an attribute-based quality criterion. The newfunction is defined as Q(C) = ws · Qs(C) + wa · Qa(C),where ws ∈ [0, 1], wa = 1 − ws, Qs(C) is the modular-ity of the original graph and Qa(C) is the modularityof an auxiliary graph obtained from the original as fol-


lows. First, we take the vertex set of the original graph.Then, we compute all pairwise similarities between theirassociated feature vectors. Similarities are computed us-ing the cosine measure. Finally, we add to the auxiliarygraph the edges corresponding to the

⌈n(n−1)

20

⌉most

similar attributed node pairs, that is the top-ranked10%. It is proven in [46] that the global sensitivity ofQs(C) is upper bounded by 3

m , where m is the minimumnumber of edges of all potential graphs to publish. Inthe worst case, ∆Qs(C) = 3, considering that the orig-inal graph is an arbitrary non-empty graph. However,this is not the case for real-life social graphs, so intro-ducing more realistic assumptions about the value of mallows us to use smaller values of ∆Qs(C) and thus reducethe amount of noise added in differentially privatelycomputing Qs(C). Throughout this paper, we assumem = 10, 000, which leads to ∆Qs(C) = 0.0003. In whatfollows, we apply an analogous reasoning for bounding∆Qa(C).

Proposition 1. Every graph G of order n satisfiesLSQa(C)(G) ≤ 60

n .

Combining the result in [46] with that of Prop. 1,we conclude that LSQ(C)(G) ≤ 0.0003 · ws + 60

|V(G)| · wafor every G satisfying the aforementioned assumptions,and use this value as an upper bound for ∆Qa(C).

5.2 Attribute Vector Distribution

As discussed in Sect. 4.4, given a community parti-tion C, in order to obtain the differentially private es-timation of Θc

X (denoted by ΘcX), we need to compute

the probability distribution of each attribute for everycommunity, i.e. Pr`(τ`(v) = x` | ψC(v),Θc

X), for each` ∈ {1, . . . , k} and C ∈ C (recall that k is the number ofattributes). Computing this probability reduces to com-puting the number of nodes in C whose `-th attributehas value 1, which we denote n`C . Let nC be the k-uple(n1C , n

2C , . . . , n

kC

). In order to obtain the differentially

private k-uple nC =(n1C , n

2C , . . . , n

kC

), we add to each

element in nC noise sampled from Lap(kεX

), where εX

is the privacy budget reserved for this computation andk is the global sensitivity of nC , as shown next.

Proposition 2. The global sensitivity of nC is k.

5.3 Attribute-Edge Correlations

Recall that the aggregator function β defined in Sect. 4.4maps every pair of attribute vectors to a non-negative

integer in B = {b sδ c | s ∈ [0, 1]}, for some δ ∈ [0, 1].In order to estimate Θc

F , we need to count, for eachpossible output of β, the number of edges whose end-nodes are mapped to this value. For every u ∈ B andevery C ∈ C, let ruC be the number of intra-communityedges (v, w) in C such that β(τ(v), τ(w)) = u. Likewise,let ruinter be the number of inter-community edges (v, w)such that β(τ(v), τ(w)) = u. Thus, in order to computeΘcF , we need to differentially privately compute ruC for

every u ∈ B and every C ∈ C, as well as ruinter for everyu ∈ B. We denote by ruC and ruinter the correspondingdifferentially private values.

The global sensitivity of every |B|-uple of the form(ru1C , ru2

C , . . . , ru|B|C

), C ∈ C, as well as that of ev-

ery |B|-uple of the form(ru1

inter , ru2inter , . . . , r

u|B|inter

), is

2(|V| − 2) [20], which is unbounded. To overcome thisproblem, we follow an approach analogous to the oneused in [20] for counting attribute-edge correlations inthe entire graph. The method, introduced in [3], consistsin truncating the edge set of the graph to ensure thatthe degree of all nodes is at most p, which in the case ofattribute-edge correlations ensures that the global sen-sitivity is 2p [3, 20]. In consequence, for every u ∈ Band every C ∈ C, we obtain ruC from ruC by adding noisesampled from Lap( 2p

εF), where where εF is the privacy

budget reserved for this computation. Likewise, we ob-tain ruinter from ruinter by adding noise sampled fromLap( 2p

εF).

5.4 CPGM Parameters

Intra- and inter-community degrees. We firstadd noise to the raw degree values and then applya post-processing on the perturbed degree sequencesto restore certain properties of the original sequence,namely graphicality and the order of the nodes interms of their degrees, as well as certain community-specific properties. Let dCintra = (d1,C

intra, d2,Cintra . . . , d

m,Cintra),

where m = |C| and di,Cintra ≤ di+1,Cintra (1 ≤ i <

|C|), be the list of non-decreasingly ordered originalintra-community degrees in C ∈ C. Analogously, letdCinter = (d1,C

inter , d2,Cinter , . . . , d

m,Cinter ) be the sequence of

inter-community degrees of nodes in C. The global sen-sitivity of the degree sequence of the entire graph is2, as adding or removing one edge changes the degreesof exactly two nodes by 1 [15]. The same is true forevery dCintra and dCinter , since the degrees of at mosttwo intra-community nodes (or at most one node inC and one node outside of C) change by 1. Thus, forevery C ∈ C, we obtain from dCintra the differentiallyprivate sequence dCintra by adding noise sampled from


Lap( 2εd

) to every degree value. Similarly, we obtain fromdCinter the differentially private sequence dCinter . After-wards, the noisy sequences are post-processed to restorethree properties: (i) the non-decreasing order betweenthe intra-community degrees inside every community,(ii) the graphicality of the intra-community degrees ofevery community, and (iii) the graphicality of the inter-community degrees of all nodes in the graph. Prop-erty (i) is enforced using the method proposed in [15],whereas properties (ii) and (iii) are enforced using themethod proposed in [24].Numbers of intra- and inter-community trian-gles. The global sensitivity of the number of trianglesin a graph is proven in [48] to be n − 2, where n isthe number of vertices. The next result characterisesthe global sensitivity of the number of intra-communitytriangles.

Proposition 3. The global sensitivity of the number ofintra-community triangles of a graph G with a commu-nity partition C is ∆nintra

4= maxC∈C{|C| − 2}.

Since the global sensitivity of triangle count queriesis unbounded, the Laplace mechanism cannot be ap-plied in this case. An accurate differentially privatemethod for counting the number of triangles of a graphis presented in [63]. This method uses the exponentialmechanism. It interprets the triangle count query as acategorical query, whose co-domain is a partition O ofZ+. One of the elements of O is a singleton set com-posed exclusively of the correct output of the query,whereas every other element contains a set of inaccu-rate values which are treated as equally useful. Theydefine the notion of ladder function, which is used as ascoring function on the elements of O. A ladder functiongives better scores to the sets of values that are closerto the correct query answer. In order to differentiallyprivately compute the number of triangles of a graphG, it is first necessary to compute the correct number oftriangles. Then, the ladder function is built, and a setO ∈ O is sampled following the exponential mechanism.Finally, a random element of O is given as the differ-entially private output of the query [63]. It is shownin [63] that the best ladder function, in the sense thatit adds the minimum necessary amount of noise, is theso-called local sensitivity at distance t [48], denoted asLSq(G, t), which is based on the notion of local sensitiv-ity. The local sensitivity of a query q on a graph G [48]is computed as LSq(G) = maxG∼G′ ‖q(G)− q(G′)‖1,that is the maximum difference between the outputof q on G and those on its neighbouring datasets.

The local sensitivity at distance t is defined as themaximum local sensitivity of the query q among allthe graphs at edge-edit distance at most t from G.Formally, LSq(G, t) = max{G′ | φ(G,G′)≤t} LSq(G′),where φ(G,G′) is the edge-edit distance between G

and G′. It is also shown in [48, 63] that LSq(G, t) =max1≤i<j≤|V| LSqij(G, t), where LSqij(G, t) =max{G′,G′′ | φ(G,G′)≤t,G′∼ijG′′ |q(G′)− q(G′′)| andG′ ∼ij G′′ indicates that G′ and G′′ differ in exactlythe addition or removal of (vi, vj).

Here, we apply the ladder function approach forcomputing the number of intra-community triangles. Tothat end, we characterise the function LSnintra

4((G, C), t)

for every graph G with a community partition C.

Proposition 4. For every graph G, every communitypartition C of G, and every positive integer t ≥ 1,

LSnintra4

((G, C), t) =

max{i,j | ψC(vi)=ψC(vj )}

{min{aij +

⌊t+ min{t, bij}

2

⌋,

|ψC(vi)| − 2}} ,

where aij = |{v` ∈ ψC(vi) | Ai,` = 1 ∧ Aj,` = 1}| andbij = |{v` ∈ ψC(vi) | Ai,` ⊕Aj,` = 1}|.

In Prop. 4, the operator ⊕ denotes exclusive or.Notice that LSnintra

4((G, C), t) can be efficiently com-

puted for small values of t, and it converges to theefficiently computable global sensitivity ∆nintra

4for t ≥

2 maxC∈C |C|, so it can be used for efficiently and pri-vately computing the number of intra-community tri-angles. Finally, for computing the number of inter-community triangles, we use the method from [63] tocompute the number of triangles of the entire graph,and subtract from it the number of intra-communitytriangles computed with the method described above.

5.5 Summary and Discussion

In what follows, we will use the notation CAGMDP torefer to a differentially private instance of CAGM. Theprivacy budget ε is split among the different compu-tations as follows: εc = ε

2 for the community parti-tion method, εF = ε

6 for the estimation of ΘcF , and

εd = ε4 = εintra4 = εX = ε

12 for the estimation of degreedistributions, triangle counts and Θc

X .As we mentioned in Sect. 2, it was shown in [32]

that dependencies among tuples can harm the level ofprivacy offered by differential privacy, since DP implic-itly assumes that tuples are independent. To showcasethis problem, they showed that in the scenario of differ-entially private degree sequence computation from an


attributed graph with public attributes, node affinity(determined by the similarity of their feature vectors)can be used to improve the adversary’s certainty on theexistence of edges. For example, users born in the samecity, who attended the same university and have approx-imately the same age are more likely to be connectedthan the average. Our approach does not suffer fromthis problem, because we do not treat node attributes aspublic. Instead, we assign synthetic attribute vectors tonodes, which are sampled from the differentially privatemodel Θc

X . In Sect. 4.3, for the sake of efficiency in thesynthetic graph generation process, we introduced theassumption that the features of one node are indepen-dent from each other. For example, even though intu-itively only one of the features “eye color is blue”, “eyecolor is brown” and “eye color is green” can take thevalue 1 for a given user, we treat them as independent.While this assumption can certainly add inaccuracies inthe generated graph, such as a user portrayed as hav-ing blue and brown eyes at the same time, it does notcreate an additional privacy risk, and it does not entailadditional assumptions of independence among tuples,other than the ones already implicit in the differentiallyprivate methods for computing the model parameters.

6 ExperimentsThe purpose of our experiments is to empirically vali-date two claims: (i) our CPGM model outperforms ex-isting models in generating graphs whose communitystructures are similar to those of the input graphs with-out sacrificing the ability to preserve global structuralproperties, and (ii) differentially private instances ofCAGM outperform preceding models in terms of commu-nity structure preservation, and remain comparable interms of the preservation of global structural properties.

6.1 Datasets

We use four real-world social networks with node at-tributes. The first one has been collected from Petster,a website for pet owners [27]. It is an undirected graph,where each node’s attributes contain information aboutthe user’s pet. We extracted 13 binary attributes from8 categorical attributes such as favourite food, gender,colour, etc. The second dataset is a subgraph of Face-book available via SNAP [31]. In this dataset, node at-tributes are already binary and are tagged with serialpseudonyms. For our experiments, we selected the first50 attributes with the smallest serial numbers. The last

Table 2. Datasets used for our experiments.

Dataset #node #edges #4 GCC #attr.Petster 1,898 12,534 16,750 0.14 13Facebook 3,953 84,070 1,526,985 0.54 50Epinions8K 8,000 67,547 203,257 0.17 50Epinions 29,515 106,147 235,790 0.13 50

two datasets are constructed from a directed graph ex-tracted from Epinions, an online consumer reviews sys-tem where every vertex represents a reviewer [37]. In theoriginal dataset, a directed edge from node A to node Bexists if user A trusts the reviews of B. For our exper-iments, we derived an undirected graph from the origi-nal dataset by keeping the same vertex set and addingan undirected edge for every pair of mutually trustingusers. Additionally, we selected the 50 most frequentlyrated products as node attributes. If the user rated theproduct, the value is set to 1. We sampled the subgraphEpinions8K from the undirected version of Epinions byrandomly selecting a seed node and progressively takingrandom neighbouring nodes until totalling 8,000. Ta-ble 2 summarises the statistics of the datasets.

6.2 Evaluation Measures

For every pair (G,G′), where G is a real-life graph andG′ is a synthetic graph sampled from a model learnedfrom G, we evaluate the extent to which G′ preservesthe following properties of G.Numbers of edges and triangles: Our evaluationmeasures in this case are the relative errors of the num-bers of edges and triangles in G′ with respect to G,

defined as ρE =∣∣|EG′ |−|EG|

∣∣|EG| and ρ4 = |n4(G′)−n4(G)|

n4(G) .Global clustering coefficient: The global clusteringcoefficient (GCC) of a graph measures the proportion ofwedges, that is, paths of length 2, that are embedded intriangles. It is defined as 3n4

nw, where nw is the number of

wedges and n4 is the number of triangles. We compareG and G′ in terms of the relative error of the GCC ofG′ with respect to that of G, which we denote ρc.Degree distribution. We compare G and G′ in termsof the Hellinger distance between their degree distri-butions. The Hellinger distance has been deemed asthe most appropriate distance for comparing probabil-ity distributions in previous works on graph synthesis-ing [20, 43]. Given two probability distributions p1 andp2 on a discrete domain W , the Hellinger distance be-tween p1 and p2 is defined as

H(p1, p2) = 1√2

√∑w∈W (

√p1(w)−

√p2(w))2.


The Hellinger distance yields values in [0, 1]. The moresimilar two distributions are, the smaller the Hellingerdistance between them. For degree distributions, wecompute pd and p′d, defined on the domain W ={0, 1, . . . , n− 1}, where n is the number of vertices in Gand G′. For every i ∈W , pd(i) (resp. p′d(i)) is the prob-ability that a vertex of G (resp. G′) has degree i. Thescore used for comparing G and G′ is Hd = H(pd, p′d).Local clustering coefficients. The local clustering co-efficient (LCC) of a node v measures the proportion ofpairs of mutual neighbours of v that are connected byan edge. In social graphs, LCC(v) is an indicator ofthe likelihood of v’s mutual friends to also be friends.

LCC(v) is defined as2∑

vi,vj ∈N (v)Ai,j

|N (v)|·(|N (v)|−1) whereN (v) is theset of v’s neighbours. For comparing G and G′ in termsof LCC, we compute the distributions p`c and p′`c, whichare defined in the domain W = {0.01, 0.02, . . . , 1.0} insuch a way that for every i ∈ W , p`c(i) (resp. p′`c(i)) isthe probability that a vertex of G (resp. G′) has LCCbetween i− 0.01 and i. We compare G and G′ in termsof H(p`c, p′`c), and denote this measure by H`c.Distribution of attribute-edge correlations. Wedefine, for each community C, the probability distribu-tion pCF , where pCF (i) is the probability that the similar-ity between the attribute vectors of two connected nodesin C is i. The original graph G, with community struc-ture C, is compared to a synthetic graph G′ in terms ofthe average of the Hellinger distances of the attribute-edge distributions of all communities of C in G fromthose in G′, defined as ρa(G,G′) = 1

|C|∑C∈C H(pCF , p̃CF ).

Detectability of community partition.We evaluateto what extent state-of-the-art community detection al-gorithms find similar communities in G and G′. To thatend, we use the averaged F1 score, denoted Avg-F1, ofthe community structures C and C′ determined by thealgorithm in G and G′, respectively. Given two com-munities C1 and C2, the F1 score between these twocommunities, denoted F1(C1, C2) combines two auxil-iary measures: precision and recall. Precision is definedas prec(C1, C2) = |C1∩C2|

|C1| , whereas recall is defined asrecall(C1, C2) = |C1∩C2|

|C2| . Precision and recall are com-bined as F1(C1, C2) = 2·prec(C1,C2)·recall(C1,C2)

prec(C1,C2)+recall(C1,C2) . If bothprecision and recall are zero, F1 is made zero by con-vention. Following the evaluation strategy introducedin [46, 60, 62], given two sets of communities C1 and C2,the average F1-score is defined as

12|C1|

∑C1

i∈C1

maxC2

j∈C2

F1(C1i , C

2j ) +

12|C2|

∑C2

i∈C2

maxC1

j∈C1

F1(C2i , C

1j ).

Avg-F1 values are in [0, 1], and larger Avg-F1 values in-dicate more similar community structures.

6.3 Results and Discussion

We first evaluate the ability of our new edge genera-tion model CPGM to synthesise graphs that preserve thecommunity structures of the original graphs along withglobal structural properties. Then, we assess the overallquality of the differentially private CAGMDP model.

6.3.1 Evaluation of CPGM

We compare CPGM with the two most similar coun-terparts reported in the literature which are amenableto DP: TriCycle [20] and DCSBM [22]. Fig. 1 shows theextent to which the community structures found bythe state-of-the-art community detection method Lou-vain [4] in the synthetic graphs generated by each modelare similar to those detected in the corresponding origi-nal graphs. Table 3 compares the behaviours of the edgegeneration models in terms of global structural proper-ties. In all columns, the values shown are averaged over100 executions, and smaller values indicate better re-sults.

Fig. 1. Similarities of community structures found by Louvain insynthetic graphs to those found in the original graphs.

From the analysis of these results, we extract threemajor observations. First, as Fig. 1 shows, the commu-nity structures of the graphs synthesised using our CPGMmodel are consistently more similar to those of the orig-inal graphs, in comparison to those induced by DCSBMand TriCycle. This supports our claim that CPGM is ableto preserve community structure to a larger extent. Notethat, in several cases, our model performs almost twiceas good as the second best, DCSBM. As expected, Tri-Cycle shows the poorest results, corroborating the intu-ition that community structure needs to be explicitlyincluded in the generative model if we want syntheticgraphs to preserve it. The previous observations sup-port our design choices of preserving (i) the community


Table 3. Comparison of edge generative models in terms of globalstructural properties.

Dataset Model ρE ρ4 ρc Hd H`c

PetsterCPGM 0.00 0.18 0.05 0.16 0.19DCSBM 0.00 0.12 0.49 0.17 0.27TriCycle 0.00 0.00 0.19 0.18 0.21

FacebookCPGM 0.00 0.03 0.32 0.15 0.32DCSBM 0.00 0.25 0.71 0.25 0.64TriCycle 0.00 0.04 0.56 0.37 0.60

Epinions8KCPGM 0.001 0.00 0.16 0.11 0.13DCSBM 0.001 0.63 0.81 0.12 0.40TriCycle 0.001 0.04 0.11 0.15 0.24

EpinionsCPGM 0.001 0.04 0.27 0.13 0.31DCSBM 0.002 0.60 0.83 0.14 0.26TriCycle 0.001 0.04 0.22 0.10 0.31

structure, and (ii) separate intra- and inter-communitystructural properties. Second, the graphs generated byour CPGM model are consistently the most accurate interms of the distributions of local clustering coefficients(see right-most column of Table 3), and have close accu-racy in terms of global clustering coefficient to the bestmodel (see column labelled ρc in Table 3). An analogousobservation can be made for the number of triangles(column labelled ρ4). We consider that these observa-tions support our design choice of preserving separateintra- and inter-community edge densities and trianglecounts. The comparably poorer performance of DCSBMin terms of global and local clustering coefficients alsocorroborates the need to explicitly model them, as doCPGM and TriCycle. A more detailed graphical descrip-tion of the behaviour of the three models in terms ofthe distributions of local clustering coefficients is shownin Fig. 4, in App. F. The figure shows the complemen-tary cumulative distribution functions of local cluster-ing coefficients for the three models, and highlights thespecial ability of CPGM to capture the behaviour of thedistribution for denser-than-normal graphs (the Face-book dataset in this case). Finally, we point out thatall three models successfully preserve the properties ofthe degree distribution. Our CPGM model produces themost consistent results in all the datasets in terms ofHellinger distances.

The results shown in this subsection support ourclaim that synthetic graphs sampled from CPGM pre-serve the community structure of the original graph toa considerably larger extent than its closest counter-parts, without sacrificing the ability to preserve globalstructural properties. These results also show that themanner in which CPGM computes intra- and inter-community parameters also helps it outperform com-

peting models in preserving local and global clusteringcoefficients.

6.3.2 Evaluation of differentially private CAGM

We compare CAGMDP with two other models. The firstone is the differentially private AGM model using TriCy-cle as edge set generator [20]. We refer to this model asAGMDP-Tri. It was shown in [20] that, despite the noiseadded to guarantee privacy, AGMDP-Tri still preserves tosome extent TriCycle’s ability to capture global structuralproperties. As we saw in the previous section, TriCycleperforms poorly in preserving the community structureof the original graph, so we consider for our evaluationan additional model, which is a modification of CAGMDPwhere CPGM is replaced by DCSBM as the edge genera-tion model. We refer to this model as CAGMDP-D.

We compare the behaviour of the three models un-der four privacy budgets: 2.0, 3.0, 4.0 and 5.0. We chosethese values to ensure that the part of the privacy bud-get devoted to ModDivisive (1.0, 1.5, 2.0 and 2.5, re-spectively) is within the same range as the budgetsused in the experiments reported in [46]. The authors ofModDivisive state that epsilon values for their methodshould be in these ranges in order to enable the accu-rate privacy-preserving generation of a search tree thatguarantees to yield communities of reasonable size. Theuse of smaller values usually leads to obtaining a sin-gle community containing all nodes, which is equivalentto obtaining no community partition at all. Each com-parison between instances of the three models uses thesame value for the privacy budget. Since CAGMDP-D andAGMDP-Tri each have fewer parameters than CAGMDP,we re-allocate in each case the remaining privacy bud-get to other computations. In estimating CAGMDP-D,we allocate to community partition the same budgetas for CAGMDP, i.e. ε

2 . CAGMDP-D requires to computethe numbers of edges between every pair of communi-ties. We assign to this computation the budget used inCAGMDP for counting the number of intra-communitytriangles, i.e. ε

12 . Finally, CAGMDP-D is given for degreesequence computation the budget ε′d = εd + ε4 = ε

6 ,as it does not require to count the global number oftriangles. Since AGMDP-Tri computes every parametercomputed by CAGMDP, except for the community par-tition (which takes half of the budget of CAGMDP) wedouble the budget assigned to every other computa-tion of AGMDP-Tri. These re-allocations give comparativeadvantages to both competing models CAGMDP-D andAGMDP-Tri in their comparison with our model, as theywill be able to more accurately compute some of the


Fig. 2. Comparison of differentially private models in terms of community structure preservation.

parameters they have in common with our model. Weallow this advantage because requiring a smaller numberof computations is a positive feature of differentially pri-vate methods, so it should not be punished in the com-parison. In CAGMDP and CAGMDP-D, ModDivisive is runwith ws = 0.98. Finally, in estimating the attribute-edgecorrelations distribution (discussed in Sect. 5.3), we setthe maximum degree parameter p to 100.

Fig. 2 displays the behaviours of the three models interms of community structure preservation, whereas Ta-bles 4, 5, 6 and 7 summarise their behaviours in termsof global structural properties and attribute-edge cor-relations on the selected datasets. In what follows, weanalyse these results from three different perspectives.Community structure preservation. In Fig. 2, thefour upper charts display the extent to which the com-munity structures found by Louvain in the syntheticgraphs generated by each differentially private modelare similar to those detected in the corresponding orig-inal graphs. Two important features of the Louvain al-gorithm are shared by ModDivisive, the method usedfor obtaining community partitions in CAGMDP andCAGMDP-D. Both generate a community partition, andboth operate by maximising modularity. In order toassess whether the community structures induced byour models in the synthetic graphs are also detectableby algorithms based on different criteria, we addition-ally obtained analogous results using the algorithmCESNA [62]. These results are shown in the lower fourcharts of Fig. 2. Unlike Louvain, CESNA takes nodeattributes into consideration for computing communi-ties. However, CESNA tends to obtain substantiallyoverlapping communities, whereas both CAGMDP andCAGMDP-D assume a partition. CESNA requires as a pa-rameter the number of communities, which we set to 10.

From the analysis of these results, the most rele-vant observation is that, considering the communitiesdetected by both Louvain and CESNA, the graphs sam-pled from our CAGMDP model consistently rank as theones whose community structure is most similar to thatof the corresponding original graphs. In the particularcases of Petster, Facebook and Epinions8K with theLouvain algorithm, the similarity values displayed byour CAGMDP model are in most cases more than twicebetter than their counterparts for AGMDP-Tri, and con-siderably more than those for CAGMDP-D. However, thisadvantage decreases as the number of vertices grows.All three models fail to effectively preserve the commu-nity structures on Epinions, which has more than 20, 000nodes. This is because the amount of noise added to en-force DP grows proportionally to the number of nodes.Distributions of attribute-edge correlations. Foreach original graph, we compute the distributions ofattribute-edge correlations in all communities detectedby CESNA. Then, we compute the equivalent distri-butions on each synthetic graph and compare them tothat of the original graph in terms of ρa (see right-mostcolumns of Tables 4, 5, 6 and 7). From the analysis ofthese results, we can see that the synthetic graphs sam-pled from CAGMDP and CAGMDP-D, the two models thatconsider community structures, consistently outperformthose sampled from AGMDP-Tri in terms of ρa. Anotherimportant observation is that the qualities, in terms ofρa, of synthetic graphs sampled from our CAGMDPmodeland those sampled from its variant CAGMDP-D are quitesimilar. This observation suggests that CAGMDP can insome cases be seen as a meta-model, where several edgegeneration models, e.g. CPGM and DCSBM, can be used.Global structural properties. From the analysis ofTables 4, 5, 6 and 7, we can see that, as expected,our CAGMDP model suffers a larger degradation than


CAGMDP-D and AGMDP-Tri in terms of the measures thatdepend on parameters for which the latter models wereallocated larger privacy budgets, most notably the num-bers of edges. The reason for these relatively larger er-rors is that our edge generative model CPGM requires toperturb two degree-related parameters (intra- and inter-community degrees) while other models just have one.This also affects the performance of CAGMDP in preserv-ing global clustering coefficients.

Table 4. Comparison of DP models on Petster.

ε Model ρE ρ4 ρc Hd H`c ρa

2.0CAGMDP 0.56 0.09 0.43 0.42 0.39 0.14CAGMDP-D 0.22 0.55 0.69 0.30 0.44 0.23AGMDP-Tri 0.25 0.09 0.25 0.23 0.30 0.17




Table 5. Comparison of DP models on Facebook.






It is worth noting, however, that in the cases wherea differentially private parameter underwent a post-processing, most notably regarding the number of trian-gles, our model obtained error rates of the same scale asAGMDP-Tri. For example, the error rate dropped to 0.04for the Facebook dataset. Also in the Facebook graph,despite the larger error rate in the number of edges,in some cases our model showed roughly the same oreven better performance in preserving the degree se-quence and clustering coefficients than the AGMDP-Trimodel. This is also reflected in preserving the density

Table 6. Comparison of DP models on Epinions8K.






Table 7. Comparison of DP models on Epinions.






of node neighbourhoods. CAGMDP has the same poweras AGMDP-Tri in preserving local clustering coefficients.Finally, since the amount of added noise grows with thenumber of vertices, synthetic graphs generated by allthree models for the Epinions dataset suffer from a moresignificant degradation in terms of structural properties.

7 ConclusionsWe have presented, to the best of our knowledge,the first community-preserving differentially privatemethod for publishing synthetic attributed graphs.To devise this method, we developed CAGM, anew community-preserving generative attributed graphmodel. We have equipped CAGM with efficient parame-ter estimation and sampling methods, and have deviseddifferentially private variants of the former. A compre-hensive set of experiments on real-world datasets sup-port the claim that our method is able to generate use-ful synthetic graphs satisfying a strong formal privacyguarantee. Our main direction for future work is to im-


prove CAGM by increasing the repertoire of community-related statistics captured by the model, and by equip-ping it with a new differentially private community par-tition method that integrates node attributes via a low-sensitivity objective function and/or differentially pri-vate maximum-likelihood estimation methods.

8 AcknowledgementsThis work was funded by Luxembourg’s Fonds Nationalde la Recherche, via grant C17/IS/11685812 (PrivDA).

References[1] Regulation (EU) 2016/679 of the European Parliament and

of the Council of 27 April 2016 on the protection of naturalpersons with regard to the processing of personal data andon the free movement of such data, and repealing directive95/46/ec (general data protection regulation). OJ, L 119:1–88, 4.5.2016.

[2] Réka Albert and Albert-László Barabási. Statistical me-chanics of complex networks. Review of Modern Physics,74:47–97, 2002.

[3] Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Shef-fet. Differentially private data analysis of social networks viarestricted sensitivity. In Proc. 4th Innovations in TheoreticalComputer Science (ITCS), pages 87–96. ACM Press, 2013.

[4] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lam-biotte, and Etienne Lefebvre. Fast unfolding of communitiesin large networks. Journal of Statistical Mechanics: Theoryand Experiment, 2008(10):P10008, 2008.

[5] Jordi Casas-Roma, Jordi Herrera-Joancomartí, and VicençTorra. An algorithm for k-degree anonymity on large net-works. In Procs. of the 2013 IEEE/ACM Int’l Conf. onAdvances in Social Networks Analysis and Mining, pages671–675, 2013.

[6] Jordi Casas-Roma, Jordi Herrera-Joancomartí, and VicençTorra. Anonymizing graphs: measuring quality for clustering.Knowledge and Information Systems, 44(3):507–528, 2015.

[7] Jordi Casas-Roma, Jordi Herrera-Joancomartí, and VicençTorra. k-degree anonymity and edge selection: improvingdata utility in large networks. Knowledge and InformationSystems, 50(2):447–474, 2017.

[8] Lauren E. Charles-Smith, Tera L. Reynolds, Mark A.Cameron, Mike Conway, Eric H. Y. Lau, Jennifer M. Olsen,Julie A. Pavlin, Mika Shigematsu, Laura C. Streichert,Katie J. Suda, and Courtney D. Corley. Using social mediafor actionable disease surveillance and outbreak manage-ment: A systematic literature review. PLOS ONE, 10(10):1–20, 10 2015.

[9] James Cheng, Ada Wai-chee Fu, and Jia Liu. K-isomorphism: privacy preserving network publication againststructural attacks. In Procs. of the 2010 ACM SIGMOD Int’lConf. on Management of Data, pages 459–470, 2010.

[10] Sean Chester, Bruce M Kapron, Ganesh Ramesh, GautamSrivastava, Alex Thomo, and S Venkatesh. Why waldo

befriended the dummy? k-anonymization of social networkswith pseudo-nodes. Social Network Analysis and Mining,3(3):381–399, 2013.

[11] Fan Chung and Linyuan Lu. The average distances in ran-dom graphs with given expected degrees. Proceedings of theNational Academy of Sciences, 99(25):15879–15882, 2002.

[12] Aaron Clauset, Cristopher Moore, and M. E. J. Newman.Hierarchical structure and the prediction of missing links innetworks. Nature, 453:98–101, 2008.

[13] Cynthia Dwork. Differential privacy. In Proc. 33rd Interna-tional Colloquium on Automata, Languages and Program-ming (ICALP), volume 4052 of Lecture Notes in ComputerScience, pages 1–12. Springer, 2006.

[14] Luoyi Fu, Xinzhe Fu, Zhongzhao Hu, Zhiying Xu, and Xin-bing Wang. De-anonymization of social networks with com-munities: When quantifications meet algorithms. arXivpreprint arXiv:1703.09028, 2017.

[15] Michael Hay, Chao Li, Gerome Miklau, and David D.Jensen. Accurate estimation of the degree distribution of pri-vate networks. In Proc. 19th IEEE International Conferenceon Data Mining (ICDM), pages 169–178. IEEE ComputerSociety, 2009.

[16] Michael Hay, Gerome Miklau, David D. Jensen, Don-ald F. Towsley, and Philipp Weis. Resisting structuralre-identification in anonymized social networks. PVLDB,1(1):102–114, 2008.

[17] Joseph J. Pfeiffer III, Timothy La Fond, Sebastián Moreno,and Jennifer Neville. Fast generation of large scale socialnetworks while incorporating transitive closures. In Proc.4th International Conference on Privacy, Security, Risk andTrust, (PASSAT), pages 154–165. IEEE Computer Society,2012.

[18] Joseph J. Pfeiffer III, Sebastián Moreno, Timothy La Fond,Jennifer Neville, and Brian Gallagher. Attributed graph mod-els: modeling network structure with correlated attributes.In Proc. 23rd International World Wide Web Conference(WWW), pages 831–842. ACM Press, 2014.

[19] Shouling Ji, Weiqing Li, Prateek Mittal, Xin Hu, and Ra-heem Beyah. Secgraph: A uniform and open-sourceevaluation system for graph data anonymization and de-anonymization. In 24th {USENIX} Security Symposium({USENIX} Security 15), pages 303–318, 2015.

[20] Zach Jorgensen, Ting Yu, and Graham Cormode. Publishingattributed social graphs with formal privacy guarantees. InProc. 2016 International Conference on Management ofData (SIGMOD), pages 107–122. ACM Press, 2016.

[21] Kundan Kandhway and Joy Kuri. Using node centrality andoptimal control to maximize information diffusion in socialnetworks. IEEE Trans. Systems, Man, and Cybernetics:Systems, 47(7):1099–1110, 2017.

[22] Brian Karrer and Mark EJNewman. Stochastic blockmodelsand community structure in networks. Physical review. E,83(1):016107, 2011.

[23] Vishesh Karwa, Sofya Raskhodnikova, Adam D. Smith, andGrigory Yaroslavtsev. Private analysis of graph structure.ACM Transactions on Database Systems, 39(3):22:1–22:33,2014.

[24] Vishesh Karwa and Aleksandra B. Slavkovic. Differentiallyprivate graphical degree sequences and synthetic graphs. InProc. 2012 International Conference on Privacy in Statis-


tical Databases (PSD), volume 7556 of Lecture Notes inComputer Science, pages 273–285. Springer, 2012.

[25] Daniel Kifer and Bing-Rong Lin. Towards an axiomatiza-tion of statistical privacy and utility. In Proc. 29th ACMSIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems (PODS), pages 147–158. ACM Press,2010.

[26] Tamara G. Kolda, Ali Pinar, Todd D. Plantenga, and C. Se-shadhri. A scalable generative graph model with communitystructure. SIAM J. Scientific Computing, 36(5), 2014.

[27] Jérôme Kunegis. KONECT: the koblenz network collection.In Proc. 22nd International World Wide Web Conference(WWW), pages 1343–1350. ACM Press, 2013.

[28] Christine Largeron, Pierre-Nicolas Mougel, Oualid Benyahia,and Osmar R Zaïane. Dancer: dynamic attributed networkswith community structure generation. Knowledge and Infor-mation Systems, 53(1):109–151, 2017.

[29] Christine Largeron, Pierre-Nicolas Mougel, Reihaneh Rab-bany, and Osmar R Zaïane. Generating attributed networkswith communities. PloS one, 10(4), 2015.

[30] Jure Leskovec and Christos Faloutsos. Scalable modeling ofreal graphs using kronecker multiplication. In Proc. 24th In-ternational Conference on Machine Learning (ICML), pages497–504. ACM Press, 2007.

[31] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanfordlarge network dataset collection. http://snap.stanford.edu/data, 2014.

[32] Changchang Liu, Supriyo Chakraborty, and Prateek Mittal.Dependence makes you vulnberable: Differential privacy un-der dependent tuples. In Procs. of NDSS 2016, volume 16,pages 21–24, 2016.

[33] Kun Liu and Evimaria Terzi. Towards identity anonymizationon graphs. In Proc. 2008 ACM SIGMOD International Con-ference on Management of Data (SIGMOD), pages 93–106.ACM Press, 2008.

[34] Xuesong Lu, Yi Song, and Stéphane Bressan. Fast identityanonymization on graphs. In Procs. of the Int’l Conf. onDatabase and Expert Systems Applications, pages 281–295,2012.

[35] Tinghuai Ma, Yuliang Zhang, Jie Cao, Jian Shen, MeiliTang, Yuan Tian, Abdullah Al-Dhelaan, and Mznah Al-Rodhaan. Kdvem: a k-degree anonymity with vertex andedge modification algorithm. Computing, 97(12):1165–1184,2015.

[36] Nelly Marquetoux, Mark A. Stevenson, Peter Wilson, AnneRidler, and Cord Heuer. Using social network analysis toinform disease control interventions. Preventive VeterinaryMedicine, 126:94–104, 2016.

[37] Paolo Massa and Paolo Avesani. Trust-aware recommendersystems. In Proc. 2007 ACM Conference on RecommenderSystems (RecSys), pages 17–24. ACM Press, 2007.

[38] Sjouke Mauw, Yunior Ramírez-Cruz, and Rolando Trujillo-Rasua. Anonymising social graphs in the presence of activeattackers. Transactions on Data Privacy, 11(2):169–198,2018.

[39] Sjouke Mauw, Yunior Ramírez-Cruz, and Rolando Trujillo-Rasua. Conditional adjacency anonymity in social graphsunder active attacks. Knowledge and Information Systems,61(1):485–511, 2018.

[40] Frank McSherry. Privacy integrated queries: an extensibleplatform for privacy-preserving data analysis. Communica-tions of the ACM, 53(9):89–97, 2010.

[41] Frank McSherry and Kunal Talwar. Mechanism design viadifferential privacy. In Proc. 48th Annual IEEE Symposiumon Foundations of Computer Science (FOCS), pages 94–103. IEEE Computer Society, 2007.

[42] Darakhshan J. Mir and Rebecca N. Wright. A differentiallyprivate graph estimator. In Proc. 2009 ICDM InternationalWorkshop on Privacy Aspects of Data Mining (ICDM),pages 122–129. IEEE Computer Society, 2009.

[43] Prateek Mittal, Charalampos Papamanthou, and Dawn Xi-aodong Song. Preserving link privacy in social network basedsystems. In Procs. of NDSS 2013. The Internet Society,2013.

[44] M. E. J. Newman and M. Girvan. Finding and evaluat-ing community structure in networks. Physical Reviewe E,69(2):026113, 2004.

[45] Mark E. J. Newman. Community detection in networks:Modularity optimization and maximum likelihood are equiva-lent. CoRR, abs/1606.02319, 2016.

[46] Hiep H. Nguyen, Abdessamad Imine, and Michaël Rusinow-itch. Detecting communities under differential privacy. InProc. 2016 ACM Workshop on Privacy in the ElectronicSociety (WPES), pages 83–93. ACM Press, 2016.

[47] Shirin Nilizadeh, Apu Kapadia, and Yong-Yeol Ahn.Community-enhanced de-anonymization of online socialnetworks. In Proceedings of the 2014 acm sigsac conferenceon computer and communications security, pages 537–548,2014.

[48] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith.Smooth sensitivity and sampling in private data analysis.In Proc. 39th Annual ACM Symposium on Theory of Com-puting (STOC), pages 75–84. ACM Press, 2007.

[49] Liudmila Ostroumova Prokhorenkova and Alexey Tikhonov.Community detection through likelihood optimization: Insearch of a sound model. In Proc. 30th World Wide WebConference (WWW), pages 1498–1508. ACM Press, 2019.

[50] François Rousseau, Jordi Casas-Roma, and Michalis Vazir-giannis. Community-preserving anonymization of graphs.Knowledge and Information Systems, 54(2):315–343, 2017.

[51] Alessandra Sala, Xiaohan Zhao, Christo Wilson, HaitaoZheng, and Ben Y. Zhao. Sharing graphs using differentiallyprivate graph models. In Proc. 11th ACM SIGCOMM In-ternet Measurement Conference (IMC), pages 81–98. ACMPress, 2011.

[52] Julián Salas and Vicenç Torra. Graphic sequences, distancesand k-degree anonymity. Discrete Applied Mathematics,188:25–31, 2015.

[53] Yazhe Wang, Long Xie, Baihua Zheng, and Ken CK Lee.High utility k-anonymization for social network publishing.Knowledge and Information Systems, 41(3):697–725, 2014.

[54] Yue Wang and Xintao Wu. Preserving differential privacy indegree-correlation based graph generation. Transaction onData Privacy, 6(2):127–145, 2013.

[55] Yue Wang, Xintao Wu, Jun Zhu, and Yang Xiang. On learn-ing cluster coefficient of private networks. Social NetworkAnalysis and Mining, 3(4):925–938, 2013.

[56] Paul W.Holland, Kathryn Blackmond Laskey, and SamuelLeinhardt. Stochastic blockmodels: First steps. Social

http://snap.stanford.edu/data

http://snap.stanford.edu/data


Networks, 5(2):109–137, 1983.[57] Gilbert Wondracek, Thorsten Holz, Engin Kirda, and

Christopher Kruegel. A practical attack to de-anonymizesocial network users. In 2010 IEEE Symposium on Securityand Privacy, pages 223–238. IEEE, 2010.

[58] Wentao Wu, Yanghua Xiao, Wei Wang, Zhenying He, andZhihui Wang. K-symmetry model for identity anonymiza-tion in social networks. In Procs. of the 13th Int’l Conf. onExtending Database Technology, pages 111–122, 2010.

[59] Qian Xiao, Rui Chen, and Kian-Lee Tan. Differentially pri-vate network data release via structural inference. In Proc.20th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD), pages 911–920. ACMPress, 2014.

[60] Jaewon Yang and Jure Leskovec. Overlapping communitydetection at scale: a nonnegative matrix factorization ap-proach. In Proc. 6th ACM International Conference on WebSearch and Data Mining (WSDM), pages 587–596. ACMPress, 2013.

[61] Jaewon Yang and Jure Leskovec. Defining and evaluatingnetwork communities based on ground-truth. Knowledgeand Information Systems, 42(1):181–213, 2015.

[62] Jaewon Yang, Julian J. McAuley, and Jure Leskovec. Com-munity detection in networks with node attributes. InProc. 13th IEEE International Conference on Data Mining(ICDM), pages 1151–1156. IEEE Computer Society, 2013.

[63] Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Di-vesh Srivastava, and Xiaokui Xiao. Private release of graphstatistics using ladder functions. In Proc. 36th ACM Inter-national Conference on Management of Data (SIGMOD),pages 731–745. ACM Press, 2015.

[64] Yang Zhang, Mathias Humbert, Bartlomiej Surma, PraveenManoharan, Jilles Vreeken, and Michael Backes. Towardsplausible graph anonymization. In Procs. of NDSS 2020,2020.

[65] Bin Zhou and Jian Pei. The k-anonymity and l-diversity ap-proaches for privacy preservation in social networks againstneighborhood attacks. Knowledge Information Systems,28(1):47–77, 2011.

[66] Dmitry Zinoviev. Information Diffusion in Social Networks,pages 146–163. 11 2011.

[67] Lei Zou, Lei Chen, and M. Tamer Özsu. K-automorphism:A general framework for privacy preserving network publica-tion. PVLDB, 2(1):946–957, 2009.

A Proof of Proposition 1Proposition 1. Every graph G of order n satisfiesLSQa(C)(G) ≤ 60

n .

Proof. Let G ∼at G′ be two neighbouring attributedgraphs and let Ga and G′a be the auxiliary graphs ob-tained from G and G′, respectively. If the difference be-tween G and G′ consists only in one edge, then Ga = G′a,so in what follows we will consider that G and G′ differ

in one attribute vector. Let v be the (sole) vertex suchthat τG(v) 6= τG′(v). In the worst case, we have that,for every w ∈ V \ {v}, (v, w) ∈ Ga and (v, w) /∈ G′a (orvice versa). It was shown in [46] that the modularitiesof two graphs differing in one edge differ in up to 3

m ,where m is the minimum number of edges. Then, in theworst case we have LSQa

(G) ≤ 3(n−1)ma

, where n is theorder of G and G′, and ma is the minimum number ofedges in auxiliary graphs. As we discussed in Sect. 5.1,ma ≥ n(n−1)

20 , so LSQa(G) ≤ 60

n . The proof is thus com-pleted.

B Proof of Proposition 2Proposition 2. The global sensitivity of nC is k.

Proof. Let G ∼at G′ be two neighbouring attributedgraphs, let C ⊆ V be a community and let nC(G) andnC(G′) be the instances of nC =

(n1C , n

2C , . . . , n

kC

)in

G and G′, respectively. If the difference between G andG′ consists only in one edge, then nC(G) = nC(G′), soin what follows we will consider that G and G′ differin one attribute vector. Let v be the (sole) vertex suchthat τG(v) 6= τG′(v). If v /∈ C, then nC(G) = nC(G′).On the contrary, if v ∈ C, for every component ` ∈{1, . . . , k} such that τ`,G(v) 6= τ`,G′(v), we have that∣∣n`C(G)− n`C(G′)

∣∣ = 1. In consequence, we have ∆nC =max

G∼atG′

∥∥nC(G)− nC(G′)∥∥

1 = k.

C Proof of Proposition 3Proposition 3. The global sensitivity of the number ofintra-community triangles of a graph G with a commu-nity partition C is ∆nintra

4= maxC∈C{|C| − 2}.

Proof. Let G ∼at G′ be two neighbouring attributedgraphs and let C be a community partition of V.Let nintra

4 (G) and nintra4 (G′) be the numbers of intra-

community triangles of G and G′, respectively. If the dif-ference between G and G′ consists only in one attributevector, then nintra

4 (G) = nintra4 (G′), so in what follows

we will consider that G and G′ differ in one edge. We willassume, without loss of generality, that E ′ \ E = (v, v′).Two cases are possible for ψC(v) and ψC(v′):

(i) ψC(v) 6= ψC(v′). In this case, since (v, v′) is aninter-community edge, nintra

4 (G) = nintra4 (G′).


(ii) ψC(v) = ψC(v′) = C. In this case, we have thatnintra4 (G′)−nintra

4 (G) = |C∩NG(v)∩NG(v′)|, thatis the number of common neighbours of v and v′

in the same community.

It is simple to see that every pair of vertices v and v′ suchthat ψC(v) = ψC(v′) = C satisfy |C ∩NG(v)∩NG(v′)| ≤|C| − 2. Hence, ∆nintra

4= maxC∈C{|C| − 2}.

D Proof of Proposition 4Proposition 4. For every graph G, every communitypartition C of G, and every positive integer t ≥ 1,

LSnintra4

((G, C), t) =


{min{aij +

⌊t+ min{t, bij}

2

⌋,

|ψC(vi)| − 2}} ,

where aij = |{v` ∈ ψC(vi) | Ai,` = 1 ∧ Aj,` = 1}| andbij = |{v` ∈ ψC(vi) | Ai,` ⊕Aj,` = 1}|.

Proof. Consider a graph G with a community partitionC, and a positive integer t ≥ 1. As discussed in [48, 63],

LSnintra4

((G, C), t) = max1≤i<j≤|V|

LSnintra4ij ((G, C), t).

For every i and j such that ψC(vi) 6= ψC(vj), we havethat no intra-community triangle is created (resp. de-stroyed) by the addition (resp. removal) of (vi, vj), soLSn

intra4ij ((G, C), t) = 0. Thus,

LSnintra4

((G, C), t) = max{i,j | ψC(vi)=ψC(vj)}

LSnintra4ij ((G, C), t).

We now focus on determining LSnintra4ij ((G, C), t) for

every i and j such that ψC(vj) = ψC(vj) = C. Consider apair of such values i and j, and let S1 = {v` ∈ C | Ai,` =1 ∧Aj,` = 1} and S2 = {v` ∈ C | Ai,` ⊕Aj,` = 1}1.

Let Gt be the class of all graphs that can be obtainedby modifying G as follows:

1. Add min{bij , t} arbitrary edges of the form (x, y),where x ∈ {vi, vj} and y ∈ S2.

2. If t > bij , take an arbitrary subset S3 of ver-tices of C \ (S1 ∪ S2 ∪ {vi, vj}), with cardinality

1 Note that S1 and S2 are the sets whose cardinalities defineaij and bij , respectively

min{⌊

t−bij

2

⌋, |C \ (S1 ∪ S2 ∪ {vi, vj})|

}. For every

x ∈ S3, add the edges (vi, x) and (vj , x).

From the definition of Gt, it follows that every G′ ∈ Gtsatisfies φ(G,G′) ≤ t and the graph G′′ ∼ij G′ satisfies

|nintra4 (G′)− nintra

4 (G′′)| = min{aij +

⌊t+min{t,bij}

2

⌋, |C| − 2

}.

Now, consider an arbitrary graph G′, obtained bymodifying G, such that φ(G,G′) ≤ t and G′ /∈ Gt. Alsoconsider the graph G′′ ∼ij G′. According to the defini-tion of Gt, the following situations are possible:

(i) G′ is the result of adding to G a proper subset ofthe set of edges added by steps 1 and 2 of the pro-cedure described above for obtaining an elementof Gt. In this case, only a proper subset of the tri-angles created (resp. destroyed) by the addition(resp. removal) of (x, y) is added (resp. removed).Thus,


4 (G′′)| <

min{aij +

⌊t+ min{t, bij}

2

⌋, |C| − 2

}.

(ii) G′ is the result of applying t− t′ additional mod-ifications (t′ < t) on an element H of Gt. Notethat, by the definition of edge-edit distance, theadditional modifications do not consist in revert-ing any edge addition made in steps 1 and 2 ofthe procedure described above. In this case, noneof the additional modifications can result in theaddition of a par of edges of the form (vi, x) and(vj , x), with x ∈ C, so


4 (G′′)| =

|nintra4 (H)− nintra

4 (H′)| =

min{aij +

⌊t+ min{t, bij}

2

⌋, |C| − 2

},

where H ′ ∼ij H.(iii) In every other case, the transformation that allows

to obtain G′ from G can be divided into a set ofedge additions as the ones described in (i) and a setof additional modifications as the ones describedin (ii). Applying an analogous reasoning, we havethat


4 (G′′)| <

min{aij +

⌊t+ min{t, bij}

2

⌋, |C| − 2

}.


Summing up the set of cases analysed above, we havethat, for every i and j such that ψC(vi) = ψC(vj),

LSnintra

4ij ((G, C), t) =

max{G′,G′′ | φ(G,G′)≤t,G′∼ijG′′}

{|nintra4 (G′)− nintra

4 (G′′)|} =

min{aij +

⌊t+ min{t, bij}

2

⌋, |ψC(vi)| − 2

}and, in consequence,

LSnintra4

((G, C), t) =


LSnintra

4ij ((G, C), t) =


{min{aij +

⌊t+ min{t, bij}

2

⌋,

|ψC(vi)| − 2}} .

The proof is thus completed.

E Resistance againstcommunity-enhancedre-identification attacks

We implemented the strongest community-enhanced re-identification attack reported in the literature [47], fol-lowing the specification given in the paper. Fig. 3 showsthe success rate of this attack on pseudonymised ver-sions of synthetic graphs generated by our model. Inthe figure, dashed lines represent the success rates ofthe attack on the original graphs, whereas solid linesrepresent the success rates on the pseudonymised syn-thetic graphs. The figure clearly shows that, in the bestcases, the attack identifies around 5% of users in thepseudonymised synthetic graphs generated by our ap-proach. These values are considerably low in compari-son with the success rates on the original graphs, whichrange from 65% to 75%. The success rates obtained bythe attack on the original graphs in these experimentsare consistent with the ones reported in [47]. Regardingthe attack’s ineffectiveness on synthetic graphs, it stemsfrom its underlying assumption that, after anonymisa-tion, the number of changes in the neighbourhood ofeach individual vertex is relatively small. This assump-tion does not hold in the scenario of graph synthesis. Inthis scenario, the methods focus on generating graphsthat are similar to the original one in terms of globalproperties. However, the synthetic edge sets generallydiffer from the original ones to a larger extent than as-sumed by the attack. This behaviour effectively thwartsthe attack, and is rather independent of epsilon values,

as illustrated in Fig. 3. In fact, we verified that the at-tack is thwarted even when the synthetic graphs aregenerated from non private instances of the model.

Fig. 3. Resistance against community-enhanced re-identificationattack from [47].

F Distribution of local clusteringcoefficients

Fig. 4 shows the comparison of distributions of localclustering coefficients in terms of the complementary cu-mulative distribution functions (CCDF).


(a) Petster

(b) Facebook

(c) Epinions8K

(d) Epinions

Fig. 4. Comparison of edge generative models in terms of com-plementary cumulative distribution functions of local clusteringcoefficient distributions.

xihui chen, sjouke mauw, and yunior ramírez-cruz

Documents