social discrete choice models - university of...

Social Discrete Choice Models

Danqing Zhang∗Civil & Environmental Engineering,

UC [email protected]

Kimon Fountoulakis†Statistics, UC [email protected]

Junyu CaoIndustrial Engineering & Operations

Research, UC [email protected]

Mogeng YinCivil & Environmental Engineering,


Michael MahoneyStatistics, UC Berkeley

[email protected]

Alexei PozdnoukhovCivil & Environmental Engineering,


KEYWORDSDiscrete choice models, computational social science, social in�u-ence, latent class model, Monte Carlo expectation maximization,alternating direction method of multiplier

1 ABSTRACTHuman decision making underlies data generating process in multi-ple application areas, and models explaining and predicting choicesmade by individuals are in high demand. Discrete choice modelsare widely studied in economics and computational social sciences.As digital social networking facilitates information �ow and spreadof in�uence between individuals, new advances in modeling areneeded to incorporate social information into these models in addi-tion to characteristic features a�ecting individual choices.

In this paper, we propose the �rst latent class discrete choicemodel that incorporates social relationships among individualsrepresented by a given graph. We add social regularization to rep-resent similarity between friends, and we introduce latent classesto account for possible preference discrepancies between di�erentsocial groups. Training of the social discrete choice models is per-formed using a specialized Monte Carlo expectation maximizationalgorithm. Scalability to large graphs is achieved by parallelizingcomputation in both the expectation and the maximization steps.Merging features and graphs together in a single discrete choicemodel does not guarantee success in terms of prediction. The datagenerating process needs to satisfy certain properties. We presentand explain extensive experiments in which we demonstrate onwhich types of datasets our model is expected to outperform state-of-the-art models.

2 INTRODUCTIONIn many application domains human decision making is modeled bydiscrete choice models. These models specify the probability that aperson chooses a particular alternative from a given choice set, withthe probability expressed as a function of observed and unobservedlatent variables that relate to the attributes of the alternatives andthe characteristics of the person. Multinomial logit models are in themainstream of discrete choice models, with maximum likelihoodused for parameter estimation from manually collected empiricaldata. It is important for practitioners to interpret the observed∗equally contributing author†equally contributing author

choice behaviors, and models that are linear in parameters are mostcommon. At the same time, choice preferences within di�erentsocial groups (though seemingly similar in terms of the observedcharacteristics) can vary signi�cantly due to the unobserved factorsor di�erent context of the choice process. One way of accountingfor this is to introduce latent class models. Latent class logisticregression models are common tools in multiple domains of socialscience research [4, 15, 18].

It is also recognized that social in�uence can be a strong factorbehind variability in choice behaviors. The impact of social in�u-ence on individual decision-making has attracted a lot of attention.Researchers have employed laboratory experiments, surveys, andstudied historical datasets to evaluate the impact of social in�uenceon individual decision making. It is however di�cult to avoid anidenti�cation problem in the analysis of in�uence processes in socialnetworks [12]. One has to account for endogeneity in explanatoryvariables in order for claims of causality made by these experimentsto be useful [8]. Due to these limitations of observational studiesof in�uence, randomized controlled trials are becoming more com-mon. In general, distinguishing social in�uence in decision makingfrom homophily, which is de�ned as the tendency for individualswith similar characteristics and choice behaviors to form clustersin social networks, is currently a growing area of research anddebate [3, 16].

Data science research in the area is more concerned with devel-oping models that scale to large graphs. Social connectivity infor-mation is used to improve predictive performance of the models,with auto-regressive approaches based on label propagation beingwidespread. Graph regularization ideas that penalize parameterdi�erences among the connected nodes were studied in the contextof classi�cation, clustering and recommender systems [1, 11], withrecent advances in distributed optimization applied to parametricmodels on networks [9].

In this paper, we introduce social graph regularization ideas intolatent class discrete choice modeling. We aim to combine the ex-pressiveness of parametric model speci�cations and descriptiveexploratory power of latent class models with recent advances indistributed optimization for parameter estimation. Our model spec-i�cation is introduced in Section 3. Parameter estimation of thesocial discrete choice models is performed using a specialized MonteCarlo expectation maximization algorithm presented in Section 4.We adopt modern optimization techniques to achieve scalability tolarge graphs by parallelizing computation in both the expectation

KDD’17, August 2017, Halifax, Nova Scotia, Canada

and the maximization steps. Finally, in Section 5, we present andexplain extensive experiments in which we demonstrate on whichtypes of social processes and datasets our model is expected to pro-vide advatages over the state-of-the-art methods. We also describeour approach of dealing with missing labels at nodes that can notbe removed without altering the social structure of the graph. Wethen draw some conclusions and outline future work, includingextensions to deep models, in Section 6.

3 SOCIAL MODELSIn this section we discuss how one can add social factors in choicemodels and the advantages of doing so. We start by explainingnotation. Then we will provide certain examples and comment oncases where data are missing and how one can treat this problem.

We de�ne [N ] := {1, 2, . . . ,N }, i ∈ [N ], t ∈ [K], where N and Kare integers. We will use the following notation and de�nitions.

- N : number of individuals,- K : number of latent classes,- n: number of samples per node,- d : number of features for each individual,- xi ∈ Rd×n : feature-samples matrix of individual i ,- zi ∈ [K]: latent class variable of individual i ,- yi ∈ {−1, 1}: binary choice of individual i ,- Wt ∈ Rd : model coe�cients of class t ,- bit ∈ R: model o�set coe�cients of individual i with classt ,

- V : set of nodes in a social graph, with each node corre-sponding to an individual,

- E1: set of edges, presenting relationship between two indi-viduals, (i, j ) ∈ E means that there exists an edge betweenindividuals i and j.

We assume that there is only one sample per node, i.e., n = 1,however, the proposed models can be extended to the case of n > 1.We further assume that the graph (V, E) is unweighted, noting thatthe models can be extended to weighted graphs. In the followingsubsections we occasionally drop indices i and t depending on thecontext to simplify notation. We denote with θ := {W ,b} the set ofmodel coe�cientsWt , bit , ∀i, t .

Let hit (xi ) := WTt xi + bit , we consider the following choice

model for individual i of class t ,

yit =

1 if hit (xi ) ≥ c,

−1 otherwise,(1)

where c ∈ R is a decision threshold constant.

3.1 Social logistic regressionChoice model speci�ed by Eq. (1) includes several known modelssuch as logistic regression, where K = 1 and yi follows a Bernoullidistribution given xi . To incorporate the social aspect in logisticregression one assumes that the parameters b follow an exponentialfamily parametrized with the given graph

P (b) ∝∏

(i, j )∈E

e−λ (bi−bj )2,

1We assume that each edge in the graph is represented in the set E only once.

where λ ∈ R is a hyper-parameter. This model is usually trainedby using a maximum log-likelihood estimator which reduces to thefollowing regularized logistic regression problem

θ∗ := argminθ

N∑i=1

log(1 + e−yihi (xi )

)+ λ

∑(i, j )∈E

(bi − bj )2,

where hi (xi ) :=WT xi + bi . Notice that the social information, i.e.,edges E, appears as Laplacian regularization for the coe�cients b.

3.2 Latent class social logistic regressionOur �rst extension is a logistic regression with latent classes, K > 1.In this model, yit follows a Bernoulli distribution given xi andzi = t . To incorporate social information we assume that latentclass variables zi are distributed based on the following exponentialfamily parametrized by the given social graph

P (z;b) ∝∏

(i, j )∈E

exp *.,−λ

K∑t=1

(bit − bjt )21(zi = zj = t )+/

-,

where b represents the collection of coe�cients bit ∀i, t whichare the parameters of the distribution. In this model we assumethat each individual has its one local coe��cient bit for each class.Notice that this model does not penalize di�erent coe�cients bitamong connected individuals in di�erent classes. This is becausewe assume that connected individuals in di�erent classes shouldhave independent linear classi�ers.

In the common latent class models, latent class variables areindependent and identically distributed following the multinomialdistribution, i.e., zi ∼ Mult(π , 1) ∀i , where π is the probability ofsuccess. However, in our speci�cation hidden variables are corre-lated and are not identically distributed. Note that this speci�cationallows studying social structures that underlie an observed choiceprocess, and hence extends the range of inferences possible withthe state-of-the-art models.

A graphical interpretation of this model is given in Figure 1. Theresulting model can be trained using maximum likelihood and theExpectation-Maximization (EM) algorithm; details are discussed inSection 4.

4 PARAMETER ESTIMATIONWe will use the maximum likelihood estimator to estimate theparametersW and b for the social choice models. We will discusshow to train the parameters of the models for the case where thereis more than one class, i.e., K > 1. To train models with latentclasses it is common to use the Expectation-Maximization (EM)algorithm. In this section we provide details for the Expectationstep (E-step) and the Maximization step (M-step).

4.1 Monte Carlo EMCorrelation among latent variables imposed by the social graph donot allow exact calculation of posterior distributions in the E-stepusing standard EM approaches. Instead, an approximate calculationof the E-step using Monte Carlo EM (MCEM) [5, 10, 13, 20] isemployed. It is a modi�cation of the original EM algorithm wherethe E-step is conducted approximately using a Monte Carlo MarkovChain (MCMC) algorithm. The details for each step of the MCEM

Social Discrete Choice Models KDD’17, August 2017, Halifax, Nova Scotia, Canada

[K]bi

zi ∈ [K], i ∈ [N ]

[K]

yi

[d]xi

i ∈ [N ]

[K ,d]W

λ

i ∈ [N ]

Figure 1: Graphical model representation for social logisticregression models with latent variables. We use a modi�edplate notation to represent conditional dependence amongrandomvariables and dependence onparameters. In particu-lar, random variables are represented using circles and theirnumber is shown in brackets inside the circle, i.e., yi corre-sponds to K variables. Parameters are represented in rect-angles and their size is shown in brackets with two com-ponents, i.e.,W corresponds to K × d coe�cients. Data areshown in rectangles and their size in brackets, i.e., xi corre-sponds to d features. There are N nodes in the graph andeach node corresponds to a random variable zi which takesvalues in [K] := {1, . . . ,K }. The hyper-parameter λ is repre-sented using a grey rectangle.

algorithm for the proposed models are provided in the followingsubsections.

Algorithm 1 MCEM algorithm for social choice models1: Inputs:

(xi ,yi ), i = 1 . . .N2: Initialize:

θ0 := {W 0,b0} ← arbitrary value, k ← 03: repeat4: E-step: (Subsection 4.1.1)5: Calculate approximate node posterior6: for each node i ∈ [N ]

q(zi = t ) := P (zi = t |yi ,xi ;bk )

, and for each edge (i, j ) ∈ E, the edge posterior

q(zi = zj = t ) := P (zi = t , zj = t |yi ,yj ,xi ,x j ;bk )

7: by using the block MCMC sampling (Algorithm 2).8: M-step: (Subsection 4.1.2)9: Solve the optimization problem

θk+1 := argminθ

Q (θ ;x ,y),

where Q (θ ;x ,y) is de�ned at (4).10: k ← k + 111: until termination criteria are satis�ed.

4.1.1 Expectation step. In this step the objective is to computethe marginal posterior distribution for nodes and edges, which

will be used in the M-step to calculate the negative expected log-likelihood function. Please refer to the Appendix for derivationof negative expected log-likelihood, which reveals the need forcalculation of marginal posterior distributions.

In particular, for the E-step one needs to calculate the followingnode marginal posterior probability

P (zi = t |yi ,xi ;θ ) =P (yi |xi , zi = t ;θ )P (zi = t ;b)K∑s=1

P (yi |xi , zi = s;θ )P (zi = s;b)(2)

and the following edge posterior probabilityP (zi = t , zj = t |yi ,yj ,xi ,x j ;θ ) =

P (yi ,yj |xi ,x j , zi = t , zj = t ;θ )P (zi = t , zj = t ;b)K∑

m,q=1P (yi ,yj |xi ,x j , zi =m, zj = q;θ )P (zi =m, zj = q;b)

, (3)

where θ represents the collection of parametersW and b. For smallgraphs we can approximate the above distributions using standardMCMC algorithms. For large graphs please refer to the Scalabilityof the E-step Subsection 4.2.1.

4.1.2 Maximization step. Let us denote with q(zi = t ) = P (zi :=t |yi ,xi ;θ ) and q(zi = zj = t ) := P (zi = t , zj = t |yi ,yj ,xi ,x j ;θ ) themarginal posterior distributions. The M-step of the EM algorithmrequires minimizing the negative expected log-likelihood function

Q (θ ;x ,y) :=∑i ∈V

K∑t=1

q(zi = t ) log(1 + e−yihit (xi )

)(4)

+ λ∑

(i, j )∈E

K∑t=1

(bit − bjt )2q(zi = zj = t ),

where θ represents the collection of parameters W and b andhit (xi ) :=WT

t xi + bit . Derivation of this function is given in Sub-section 7.1 in the Appendix. For small graphs standard convexoptimization solvers can be used. For large graphs please refer toSection 4.2.2 where we discuss how we can maximize the expectedlog-likelihood e�ciently with a distributed algorithm.

4.2 ScalabilityIn this subsection we discuss how to distribute computation tomake our algorithm �exible enough to deal with large graphs.

4.2.1 Scalability of E-step. Calculating the marginal posteriorprobabilities (2) and (3) is computationally expensive due to themarginal probabilities P (zi = t ;b) and P (zi = t , zj = t ;b). This isbecause to calculate the latter two we have to marginalize N − 1and N − 2 latent variables, respectively. We avoid this problem byusing a block MCMC sampling technique to compute P (zi = t ;b)and P (zi = t , zj = t ;b). The algorithm is given in Algorithm 2.

The algorithm uses a preprocessing step to partition the graphinto c disjoint communities. Then it runs an MCMC algorithm oneach community/block in parallel by ignoring the edges among theblocks.

Let us comment on how the number of blocks c a�ects conver-gence of the MCMC Algorithm 2 to P (zi = t ;b) and P (zi = t , zj =t ;b) in practice. In Figure 2 we present the relative error of theMCMC algorithm for various values of c , settings of parameters


Algorithm 2 Block MCMC sampling1: Inputs:

Graph with nodesV and edges E, parameters θ ,number of blocks c

2: Preprocess: Use community detection algorithm, i.e., [14], topartition the graph in c disjoint communities/blocks, i.e., C :={C1, . . . ,Cc },

⋃cq=1 Cq = V and

⋂cq=1 Cq = ∅.

3: parfor each community in C do4: Compute the node and edge marginal probabilities5: P (zi = t ;b) and P (zi = t , zj = t ;b), respectively, using6: an MCMC sampling algorithm by taking into account7: edges only among nodes in the community.8: end parfor

(a) Node probs., grouped b (b) Node probs., random b

(c) Edge probs., grouped b (d) Edge probs., random b

(e) Caveman, three blocks (f) Caveman, random blocks

Figure 2: This �gure illustrates the accuracy of blockedMCMC Algorithm 2. Details are provided in Section 4.2.1.

b and ways of grouping nodes for the caveman graph [19]. Thecaveman graph consists of three communities of ten nodes each,see Figure 2e. We consider two classes for the latent variables. Wemake use of two ways of grouping nodes, the �rst uses a com-munity detection technique [14], see Figure 2e, the second is bygrouping nodes randomly, see Figure 2f. Moreover, we considertwo settings of b. 1) Parameters b are separated in three groupsthat correspond to the three groups of nodes of the graph. For eachgroup parameters b follow a Gaussian distribution with di�erentmean such that the parameters are signi�cantly di�erent amonggroups but similar within the groups. Figures 2a and 2c correspondto the �rst setting of parameters b. 2) Parameters b follow the sameGaussian distribution regardless of the groups of the graph. Figures2b and 2d correspond to the second setting of parametersb. The twosettings correspond to the last and the �rst iterations of the MCEMalgorithm, respectively, since we expect that close to stationary

points of the likelihood function parameters b are grouped basedon the communities of the given graph while at the �rst iterationsparameters b have a random setting; we discuss this in Section 5.

Figures 2a and 2c illustrate accuracy of block MCMC algorithmfor node and edge probabilities, respectively. Notice that the accu-racy of block MCMC is overall better when community detectionis used compared to random assignment of nodes in communities.This is because preprocessing helps to minimize the edges amongthe blocks that are ignored. On the other hand, in Figures 2b and 2dwhen parametersb are not grouped, then community detection doesnot perform better than random assignment of nodes in communi-ties. Moreover, notice in Figures 2b and 2d that the performance ofblock MCMC does not improve as the number of samples increases.This happens because parameters b are randomly generated andthey are not grouped, then based on MCMC updating rule, we getthat P (zi = t ;b) ≈ 1 for t = 0 and P (zi = 1;b) ≈ 0 for t = 0.

Let us now comment brie�y on the theoretical asymptotic con-vergence of MCEM to a stationary point of the likelihood function.Convergence theory of MCEM in [5, 13] states that if Algorithm 2is set to work with one block c = 1 (no partition of the graph), andthe MCMC sample size increases deterministically across MCEMiterations, then MCEM converges almost surely. A consequence ofblocking of latent variables for the MCMC algorithm is that asymp-totic convergence of MCEM is not guaranteed anymore. However,in practice often MCEM is terminated without even knowing ifthe algorithm converges to an accurate solution. See for exampleSection 5 in [13] and references therein about arbitrary terminationcriteria of MCEM. Therefore, we consider that the parallelism ofblock MCMC Algorithm 2 o�ers a trade-o� among convergenceand computational complexity per iteration by controlling the num-ber of blocks, which in practice can speed up each iteration of theMCEM algorithm signi�cantly.

4.2.2 Scalability of M-step. In this section we discuss how tominimize in distributed manner the negative expected log-likelihoodfunction of the M-step of MCEM Algorithm 1. Following the workof [9] that applies ADMM to network lasso method, we extend it byallowing both local and global variables on nodes in the presenceof latent classes. We will focus the derivation on the example ofsocial logistic regression with latent classes which was discussed inSubsection 3.2. For reasons of simplifying the notation we do notprovide details for the social neural network model, but we do notcomment at appropriate places how ADMM can be applied for thismodel.

First notice that the negative expected log-likelihood function(4) can be separated in K independent problems, one for each classt . Therefore, in this section we discuss how to apply distributedADMM for each problem. Let

Q (θ ;x ,y, t ) :=∑i ∈V

q(zi = t ) log(1 + e−yihit (xi )

)(5)

+ λ∑

(i, j )∈E

(bit − bjt )2q(zi = zj = t )

be the objective function for problem with latent class t , whereθ represents the collection of parametersW and b and hit (xi ) :=WTt xi + bit . Then Q (θ ;x ,y) in (4) is equal to

∑Kt=1 Q (θ ;x ,y, t ). Let

us de�ne qit := q(zi = t ) and qi jt := q(zi = zj = t ). To minimize


(5) for a given class t using ADMM we introduce a copy of bitdenoted by zi j , ∀i and a copy ofWt denoted by дi , ∀i ,

minW ,b

∑i ∈V

qit log(1 + e−yihit (xi )

)+ λ

∑(i, j )∈E

qi jt (zi j − zji )2

s.t:. bit = zi j j ∈ N (i ), ∀iWt = дi , ∀i,

(6)whereN (i ) are the adjacent nodes of node i . By introducing copiesfor bit ∀i we dismantle the sum over edges into separable functions,additionally, by introducing copies forWt , we dismantle the sumover the nodes for the logistic function. Then by relaxing the con-straints we can make the problem (6) separable which opens thedoor to distributed computation.

We de�ne the augmented Lagrangian below, where u and h arethe dual variable and ρ1 and ρ2 are the penalty parameters.

Lρ (Wt ,bt ,д, z,u,h; t ) :=∑i ∈V

{qit log

(1 + e−yi (д

Ti xi+bit )

)+ρ12

(‖hi ‖

22 +

Wt − дi + hi 22

)}+ λ

∑(i, j )∈E

{qi jt (zi j − zji )

2

+ρ22

( ui j

22 +

uji

22 +

bit − zi j + ui j

22 +

bjt − zji + uji

22

)}.

The resulting ADMM algorithm is presented in Algorithm 3, wheref (zi j , zji ) := Lρ (W

k+1t ,bk+1

t ,дk+1, (zi j , zji , zk(i j )c ),uk ,hk ; t ). No-

tice that the subproblems in Step 6 do not have closed form so-lutions, however, they can be solved e�ciently using an iterativealgorithm since they are univariate problems which depend onlyon xi and not all data. Similarly, the subproblems in step 9 do nothave closed form solution, but they have only d unknown variablesand depend only on xi and not all data. Moreover, Step 12 has aclosed form solution, which corresponds to solving a 2 × 2 linearsystem. Observe that the ADMM algorithm 3 can be run in a dis-tributed setting by distributing the data among processors, as thesubproblems in Steps 6 and 9 depend only on xi and not all data.

Algorithm 3 ADMM for problem 6

1: Initialize:k ← 0,W k , bk , дk , zk , uk and hk

2: repeat3: SetW k+1

t = 1N

∑Ni=1 (д

ki − h

ki )

4: bk+1it := arg min

bitLρ (W

k+1t ,bit ,д

k , zk ,uk ,hk ; t ) ∀i ∈ V

5: дk+1i := arg min

дiLρ (W

k+1t ,bk+1

t ,дi , zk ,uk ,hk ; t ) ∀i ∈ V

6: zk+1i j , z

k+1ji = arg min

zi j ,zjif (zi j , zji ) ∀(i, j ) ∈ E

7: Sethk+1i = hki + (W k+1

t − дk+1i ) ∀i ∈ V

uk+1i j = uki j + (bk+1

it − zk+1i j ) ∀(i, j ) ∈ E

uk+1ji = ukji + (bk+1

jt − zk+1ji )

8: k ← k + 19: until termination criteria are satis�ed.

5 EXPERIMENTSIn this section, we analyze the empirical performance of the pro-posed social model on a range of datasets. We outline practical rec-ommendations and illustrate examples where the proposed modelis most suitable.

Our implementation reproducing the below results is availableat https://github.com/DanqingZ/social-DCM. The number of iter-ations of Gibbs sampler in the E-step grows with the number ofiterations of the MCEM algorithm. The M-step is implemented us-ing ECOS solver [7] embedded in CVXPY [6] forW and b updateswithin ADMM iterations.

5.1 Synthetic dataWe demonstrate how having choices separable on the given graphhelps to improve prediction when social information are taken intoaccount in the following two ways. i) Varying the connectivitybetween di�erent classes in the graph. ii) Varying the discrepanciesof choices in communities of the graph. In what follows, whenwe mention separability of choices on the graph we mean thatthe decisions yi are clustered based on communities in the graph.By separable in the feature space we mean that there exist linearhyperplanes which separate the data xi accurately.

5.1.1 Varying connectivity between classes. We consider N =300 individuals, each node corresponds to one feature vector xi ,d = 10 and yi ∈ {−1, 1} ∀i . 150 feature vectors xi ∈ Rd are gen-erated using a Gaussian distribution where each component in xifollows G (µ = 10,σ = 0.1), and the rest 150 feature vectors havecomponents which follow G (µ = 30,σ = 0.1). We randomly splitthe individuals into three communities with 100 individuals each.We assume that there are two classes, i.e., K = 2, and that the indi-viduals in two out of three communities are in the �rst class, i.e.,t = 1, while the individuals in the remaining community are in thesecond class, i.e., t = 2. The communities are randomly assignedto classes. All individuals in the �rst class have yi = 1, while allindividuals in the second class have yi = −1. Notice that the xivectors are assigned to communities regardless of their Gaussiandistribution and yi is set based on the classes, which correspond tocommunities. Therefore, the choices yi are separable on the graphbut not separable in the feature space. We set the probability of twoindividuals that are in the same community to get connected to 0.2.We set the probability of two individuals that are in the same classbut not in the same community to get connected to 0.01. We set theprobability of two individuals in di�erent classes to get connectedto β , which is the parameter that controls the connectivity betweenclasses.

Figure 3 shows the graph structure when β = 10−4, β = 10−2

and β = 10−1 Notice that the larger β is the more edges amongthe communities that belong in di�erent classes. Table 1 shows theprediction result of four models as a function β . Notice that sincexi and yi are not changed as β changes, therefore, the prediction oflogistic regression and logistic regression with latent class remainsconstant at 62%. The reason that these models perform poorly isbecause the choices yi are not separable given the feature vectorsxi only. Observe in Table 1 that when β is as small as 10−4, whichmeans that individuals in di�erent classes are very unlikely to getconnected, see Figure 3a, then the prediction result of the proposed

https://github.com/DanqingZ/social-DCM


(a) β = 10−4 (b) β = 10−2 (c) β = 10−1

Figure 3: The nodes with square shape and yellow color cor-respond to class t = 1. The nodes with triangle shape andturquoise color correspond to class t = 2. The larger β themore edges among nodes with di�erent class.

social models is larger than 80%. On the other hand, when β be-comes larger, the prediction of the social models is declining. Butas long as β < 0.1, the proposed social model with latent class,see Subsection 3.2, is performing better than logistic regressionand logistic regression with latent class models. When β = 0.1 thesocial models has the same prediction performance as the logisticregression and latent class models. This is because the classes arenot clearly separable on the graph, see Figure 3c.

Table 1: Prediction results on a randomly chosen test set of50 individuals when β is varied, i.e., connectivity betweenclasses. For all models the regularization parameter λ whichcorresponds to the best prediction result out of a range ofparameters is chosen.

βmodel 10−4 10−3 5 × 10−3 10−2 10−1

logistic reg. 62% 62% 62% 62% 62%log. reg. lat. class 62% 62% 62% 62% 62%social no lat. class 80% 62% 62% 62% 62%

social with lat. class 88% 82% 64% 62% 62%

5.1.2 Varying choice preference parameters. An ideal scenariofor the proposed social models is when classes correspond to com-munities of the given graph and when the choices yi are clusteredaccording to the classes. However, choices yi might be misplacedin wrong classes. We study how the preference di�erence betweenclasses a�ects the performance of the proposed model.

For this experiment individual feature vectors xi are generatedfrom a Gaussian distribution for a set of N = 200 individuals, eachcomponent of xi follows G (µ = 1,σ = 0.1). We randomly splitthis set of individuals into two equal parts, each part representsa class where individuals share the same parameters w . Let usassume the w in the �rst class is w1 and the w in the second classis w2 = −w1. In our experiments we vary parameter w1. For eachindividual bi satis�es bi ∼ G (0, 0.1). For the graph setting, we setthe probability of people in the same class to be connected, and theprobability of people in di�erent classes to be connected to 0.2 and10−4, respectively. This way, we can make sure classes correspondto communities.

Based on data generation process, by tuning ‖w1‖, we are ableto get full control of preference di�erence among individuals in the

(a) ‖w ‖2 = 0 (b) ‖w ‖2 = 4 (c) ‖w ‖2 = 6

Figure 4: Three synthetic examples showing the in�uenceof parameter w on class preference. The nodes with squareshape and yellow color correspond to choice yi = 1. Thenodes with triangle shape and turquoise color correspondto choice yi = −1.

two classes. When ‖w1‖ becomes larger then preference di�erencebecomes larger as well. As we can see in Figure 4 when ‖w1‖becomes larger more individuals in class one have yi = 1 (i.e.,yellow squares) and more individuals in class two have have yi = 1(i.e., turquoise triangles). When w = 0 in both classes around halfof the individuals have yi = 1, the other half yi = −1, which meansthere is no preference di�erence between the two classes. Predictionresults for this experiment are shown in Table 2.

Table 2: Prediction results on a randomly chosen test set of50 individuals when ‖w ‖2 is varied. For all models the regu-larization parameter λ which corresponds to the best predic-tion result out of a range of parameters is chosen.

‖w ‖2model 10 5 3 2 1 0

logistic reg. 48% 44% 42% 52% 58% 52%log. reg. lat. class 48% 48% 54% 54% 42% 36%

social with lat. class 100% 100% 94% 86% 68% 36%

5.2 Adolescent smokingThis example uses a dataset collected by [2]. This research program,known as the teenage friends and lifestyle study, has conducted alongitudinal survey of friendships and the emergence of the smok-ing habit (among other deviant behaviours) in teenage studentsacross multiple schools in Glasgow, Scotland.

5.2.1 Dataset. Social graphs of 160 students (shown in Figure 5)within the same age range of 13-15 years is constructed followinga surveyed evidence of reciprocal friendship, with an edge placedamong individuals i and j if individual i and individual j namedeach other friends. We included �ve variables into the feature vectorxi : age; gender; money: indicating how much pocket money thestudent had per month, in British pounds; romantic: indicatingwhether the student is in a romantic relationship; family smoking:indicating whether there were family members smoking at home.Notice that the feature vectors xi and the edges of the graph aredi�erent at di�erent timestamps. The response variabley representsthe stated choice is whether a student smokes tobacco (yi = 1),otherwise yi = −1.


(a) Start of the study (Feb 1995):127 non-smokers (blue),23 smokers (red),10 unobserved (yellow)422 edges.

(b) End of the study (Jan 1997):98 non smokers (blue),39 smokers (red),23 unobserved (yellow)339 edges.

Figure 5: Social graphs of student friendships and smokingbehaviors within the 2 years period of the study.

5.2.2 Missing labels. In practice one might have full observationof the graph structure, but not being able to observe choices yi foreach individual. Let us denote V as the set of nodes for whichone observes choices yi , and V as the set of nodes for which thechoices are not revealed. One approach to model speci�cation in thiscase would be to remove nodes V and their edges from the graph,disregarding data xi . However, here we empirically illustrate thatit is preferable to keep nodes V , their edges and their coe�cientsbi in the model and simply set their unrevealed choice to yi = 0,∀i ∈ V . The impact of this speci�cation of the model on parameterestimation can be illustrated as follows. For example, for sociallogistic regression withK = 1 the maximum log-likelihood problemreduces to

θ∗ := argminθ

∑i ∈V

log(1 + e−yihi (xi )

)+ λ

∑(i, j )∈E

(bi − bj )2,

where in the �rst sum one ignores nodes in V , but maintains inthe second sum their edges and coe�cients bi .

Maintaining nodes with bi ∀i ∈ V , and their edges in the graphcorrespond to additional regularization among nodes, since twonodes that are connected through a node with no observed choiceare forced indirectly to have similar coe�cients bi . It is particularlyimportant in applications where there are strong reasons to believethat social in�uence is a signi�cant factor driving the choice ofthe individuals. To provide further intuitions, consider a thoughtexperiment of comparing the structure of the social graph beforeand after removing nodes in V (shown yellow in Figure 5) andtheir edges. Our experiments suggest that maintaining the densityof the graph helps avoiding over�tting and results in estimatingmodels with more accurate out-of-sample predictions.

5.2.3 Models comparison. We measured predictions on thisdataset and report here 3-fold cross validation results of four mod-els: i) logistic regression; ii) latent class logistic regression; iii) sociallogistics regression, see Subsection 3.1; iv) social latent class logis-tics regression, see Subsection 3.2. Cross-validation process treatsthe removed nodes as nodes with missing labels, as described above.We consider K = 2 latent classes in this experiment. The predictionperformance is shown in Tables 3-4.

Table 3: Adolescent smoking prediction, February 1995

model accuracy

logistics regression 81.1%latent class logistics regression 78.9%

social logistics regression 80.0%social latent class logistics regression 82.2%

Table 4: Adolescent smoking prediction, January 1997

model accuracy

logistics regression 68.5%latent class logistics regression 72.1%

social logistics regression 65.5%social latent class logistics regression 77.1%

High performance of logistic regression and latent class modelfor the beginning of the study (Table 3) indicates that the smokingpreferences are largely de�ned by the objective factors. Moreover,parameters in the underlying classes of the latent class model don’tdi�er much. Social latent class model performs equally well. Thisdi�erence grows signi�cantly larger when a confounding variableof smoking in the family is removed from the feature list.

Furthermore, by the end of the study (January 1997, Table 4) onecan see a signi�cantly higher predictive accuracy of the social latentclass logistics regression model. It may indicate that smoking withina certain group of teenagers have become a social norm (indicatingthe di�erence in o�set parameters bit ), or that the response yi toindependent factors xi within that group di�ers from the others asre�ected by di�erences inWt .

Notice the signi�cant decrease of prediction accuracy for nonsocial models among Tables 3 and 4. This is because at the beginningof the study (February 1995) less than 15% of teenagers smoked,while at the end of the study (January 1997) about 25% of teenagerssmoked. This di�erence is likely due to social norms developedamong teenagers which are captured by the graph and thereforemissed by non social models.

5.2.4 Parameter visualization. We are going to illustrate how thespeci�cation of the proposed model allows in-depth exploration ofthe parameters to assist in making this type of conclusions. To thatend, we are going to explore parameters bit and class membershipprobabilities across the regularization path of hyper-parameter λ.

The estimated class membership (probability of being in a givenlatent class) on the graph is shown in Figure 6. A visualization ofestimated bit for one latent class for several values of λ is shown inFigure 7. When λ = 0.1, smoking pattern begins to show across thegraph, i.e., compare Figures 5b and 7b. Let us note that λ = 0.1 valuecorresponds to the best prediction accuracy in our experiments forthe proposed social latent class logistic regression model for thesmoke data at the end of the study (Jannuarry 1997). This is becausethe social latent class logistic regression model is able to clearlydistinguish a group of socially connected individuals within whichthe choice preferences towards smoking are higher.

When λ = 0.01, bit are similar across the nodes, one the otherhand, when λ = 10, bit are not similar across the nodes. Although


(a) λ = 10 (b) λ = 0.1 (c) λ = 0.01

Figure 6: Class membership probabilities estimated for thenodes formultiple values of λ. Blue, yellow, green, black andred colours correspond to probabilities about 0, (0, 0.5), 0.5,(0.5, 1) and 1, respectively.

(a) λ = 10 (b) λ = 0.1 (c) λ = 0.01

Figure 7: The values of bit estimated for the nodes for mul-tiple values of λ. Lighter color corresponds to higher values.

this is counter-intuitive since the graph regularization favours simi-lar bit across nodes for large values of λ, it is explained by the nodeand edge posterior distribution in the M-step which also controlsregularization across nodes during the execution of the algorithm.

6 CONCLUSIONS AND FUTUREWORKIn this paper we introduced social graph regularization ideas intolatent class discrete choice modeling. We have developed, imple-mented, and explored parameter estimation algorithms that allowparallel processing implementation for both E- and M-steps, bene-�ting from recent advances in distributed optimization based onADMM methods. In experimental evaluation, we have focused oninvestigating the usefulness of the model in revealing and support-ing hypothesis in studies where not only predictive performance(that was found to be highly competitive), but also understandingsocial in�uence is crucial. Our model can be directly applied tostudy social in�uence on revealed choices in large social graphswith rich node attributes (that are very rarely available in openaccess due to privacy issues).

Another promising direction for future work is as follows. Withthe recent work of [17], the speci�cation we propose can accom-modate decision functions of multiple layers L conditional on thelatent class zi ∈ [K], L > 1 and K ≥ 1. Similarly to the “shallow”models speci�cation, to incorporate social information we assumethat the variables zi are distributed based on an exponential familyparametrized by the given graph, but for this extension we takeinto account multiple layers, introducing non-linearity in choice

[N ,K]b1

zi ∈ [K], i ∈ [N ]

[K]

yi

[d]xi

i ∈ [N ]

[K ,d]W1

· · · [K ,d]WL

...

[N ,K]bL

λ

Figure 8: Graphical model representation for a deep socialchoice model with latent variables. In addition to notationin Figure 1, it includes rounded rectangles which denotegroups of coe�cients.

dependency on xi . In particular, we assume that

P (z;b) ∝∏

(i, j )∈E

exp *.,−λ

L,K∑l,t=1

(bit l − bjt l )21(zi = zj = t )+/

-,

where b represents the collection of coe�cients bit l ∀i, t , l . Thegraphical representation of this model is shown in Figure 8. Themajor di�erences and an advantage of the deep social choice modelover the social latent class logistic regression model introducedin Subsection 3.2 is the non-linearity of the decision yit in (1). Asshown in Figure 8, the structure of the graphical model remains thesame, except for that the additional layers result in an increasingthe number of b and W coe�cients that needs to be estimatedusing a maximum likelihood and the Monte Carlo expectation-maximization algorithm. In the M-step, one has to introduce copiesofW and b for each layer in order to make the objective functionseparable [17]).

7 APPENDIX7.1 Negative expected log-likelihood in Eq. (4)We denote with q(z) := P (z |y,x ;θ ) the posterior distribution, with∑z the sum over all latent variables z, and θ represents the collec-

tion of parametersW and b. Let 1(zi = t ) be the indicator function,which is equal to 1 if zi = t . We assume that latent variables followthe following distribution:

P (z;b) ∝∏

(i, j )∈E

exp *.,−λ

K∑t=1

(bit − bjt )21(zi = zj = t )+/

-,

and

P (yi |xi , zi = t ,θ ) =1

1 + e−yihi (xi ), (7)


where hit (xi ) :=WTt xi + bit . The derivation of the expected log-

likelihood is shown below.

Q (θ ;x ,y) :=∑z

q(z) log P (y, z |x ;θ )

=∑z

q(z) log(P (y |x , z;θ )P (z;b))

=∑z

q(z) log P (y |x , z;θ ) +∑z

q(z) log P (z;b)

=∑z

q(z) log *.,

∏i ∈V

K∑t=1

P (yi |xi , zi = t ,θ )1(zi = t )+/-

− λ∑z

q(z)∑

(i, j )∈E

K∑t=1

(bit − bjt )21(zi = zj = t )

=∑z

q(z)∑i ∈V

log *.,

K∑t=1

P (yi |xi , zi = t ,θ )1(zi = t )+/-

− λ∑z

q(z)∑

(i, j )∈E

K∑t=1

(bit − bjt )21(zi = zj = t ).

We can exchange the sequence of log and∑Kt=1 because each node

can only be in one class, thus we have

Q (θ ;x ,y) =∑z

q(z)∑i ∈V

K∑t=1

1(zi = t ) log P (yi |xi , zi = t ;θ )

−∑z

q(z)∑

(i, j )∈E

K∑t=1

λ(bit − bjt )21(zi = zj = t ).

Let’s write the summation over z inside the summation over verticesand the summation over latent variables

Q (θ ;x ,y) =∑i ∈V

K∑t=1

∑z

q(z)1(zi = t ) log P (yi |xi , zi = t ;θ )

− λ∑

(i, j )∈E

K∑t=1

∑z

q(z) (bit − bjt )21(zi = zj = t ),

and then we get the marginal probabilities

Q (θ ;x ,y) =∑i ∈V

K∑t=1

log(P (yi |xi , zi = t ,θ ))∑z

q(z)1(zi = t )

− λ∑

(i, j )∈E

K∑t=1

(bit − bjt )2∑z

q(z)1(zi = zj = t )

=∑i ∈V

K∑t=1

q(zi = t ) log P (yi |xi , zi = t ,θ )

− λ∑

(i, j )∈E

K∑t=1

(bit − bjt )2q(zi = zj = t ).

Using (7) and multiplying by minus equation Q (θ ;x ,y) we get thenegative expected log-likelihood function in (4).

REFERENCES[1] J. Abernethy, O. Chapelle, and C. Castillo. 2010. Graph regularization methods

for web spam detection. Mach. Learn. 81, 2 (2010), 207–225.[2] H Bush, Patrick West, and Lynn Michell. 1997. The role of friendship groups in

the uptake and maintenance of smoking amongst pre-adolescent and adolescentchildren: Distribution of Frequencies. Glasgow, Scotland: Medical Research Council(1997).

[3] Nicholas A Christakis and James H Fowler. 2013. Social contagion theory :examining dynamic social networks and human behavior. November 2011 (2013).DOI:http://dx.doi.org/10.1002/sim.5408

[4] H. Chung, B. P. Flaherty, and J. L. Schafer. 2006. Latent class logistic regression:application to marijuana use and attitudes among high school seniors. Journal ofthe Royal Statistical Society: Series A (Statistics in Society) 169, 4 (2006), 723–743.

[5] B. Deylon, M. Lavielle, and E. Moulines. 1999. Convergence of a StochasticApproximation Version of the EM Algorithm. The Annals of Statistics 27, 1 (1999),94–128.

[6] Steven Diamond and Stephen Boyd. 2016. CVXPY: A Python-embedded modelinglanguage for convex optimization. Journal of Machine Learning Research 17, 83(2016), 1–5.

[7] Alexander Domahidi, Eric Chu, and Stephen Boyd. 2013. ECOS: An SOCP solverfor embedded systems. In Control Conference (ECC), 2013 European. IEEE, 3071–3076.

[8] Elenna Dugundji and Joan Walker. 2005. Discrete Choice with Social and SpatialNetwork Interdependencies: An Empirical Example Using Mixed GeneralizedExtreme Value Models with Field and Panel E�ects. Transp. Res. Rec. 1921, 1(2005), 70–78. DOI:http://dx.doi.org/10.3141/1921-09

[9] David Hallac, Jure Leskovec, and Stephen Boyd. 2015. Network lasso: Clusteringand optimization in large graphs. In Proceedings of the 21th ACM SIGKDD Inter-national Conference on Knowledge Discovery and Data Mining. ACM, 387–396.

[10] R. A. Levine and G. Casella. 2012. Implementations of the Monte Carlo EMAlgorithm. Journal of Computational and Graphical Statistics 10, 3 (2012), 422–439.

[11] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King. 2011. Recommender systemswith social regularization. WSDM ’11 Proceedings of the fourth ACM internationalconference on Web search and data mining (2011), 287–296.

[12] Charles F. Manski. 1993. Identi�cation of Social Endogenous E�ects: The Re�ec-tion Problem. Rev. Econ. Stud. 60, 3 (1993), 531–542. DOI:http://dx.doi.org/10.2307/2298123

[13] Ronald C Neath and others. 2013. On convergence properties of the Monte CarloEM algorithm. In Advances in Modern Statistical Theory and Applications: AFestschrift in Honor of Morris L. Eaton. Institute of Mathematical Statistics, 43–62.

[14] T. P. Peixoto. 2014. E�cient Monte Carlo and greedy heuristic for the inferenceof stochastic block models. Phys. Rev. E 89 (2014), 012804.

[15] K. Roeder, K. G. Lynch, and D. S. Nagin. 1999. Modeling Uncertainty in LatentClass Membership: A Case Study in Criminology. J. Amer. Statist. Assoc. 94, 447(1999), 766–776.

[16] Cr Shalizi and Ac Thomas. 2011. Homophily and contagion are genericallyconfounded in observational social network studies. Sociol. Methods Res. (2011),1–27. arXiv:arXiv:1004.4704v3 http://smr.sagepub.com/content/40/2/211.short

[17] Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and TomGoldstein. 2016. Training neural networks without gradients: A scalable admmapproach. In International Conference on Machine Learning.

[18] J. L. Walker and J. Li. 2007. Latent lifestyle preferences and household locationdecisions. Journal of Geographical Systems 9, 1 (2007), 77–101.

[19] D. J. Watts. 1999. Networks, Dynamics, and the Small-World Phenomenon. Amer.J. Soc. 105 (1999), 493–527.

[20] G. C. .G. Wei and M. A. Tanner. 1990. A Monte Carlo Implementation of the EMAlgorithm and the Poor Man’s Data Augmentation Algorithms. J. Amer. Statist.Assoc. 85, 411 (1990), 699–704.

http://dx.doi.org/10.1002/sim.5408

http://dx.doi.org/10.3141/1921-09

http://dx.doi.org/10.2307/2298123

http://dx.doi.org/10.2307/2298123

http://arxiv.org/abs/arXiv:1004.4704v3

http://smr.sagepub.com/content/40/2/211.short

social discrete choice models - university of...

Documents