density-based place clustering using geo-social network data · 2019. 2. 26. · 9, =∪ † † ¿...

14
1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEE Transactions on Knowledge and Data Engineering 1 Density-based Place Clustering Using Geo-Social Network Data Dingming Wu, Jieming Shi, and Nikos Mamoulis, Abstract—Spatial clustering deals with the unsupervised grouping of places into clusters and finds important applications in urban planning and marketing. Current spatial clustering models disregard information about the people and the time who and when are related to the clustered places. In this paper, we show how the density-based clustering paradigm can be extended to apply on places which are visited by users of a geo-social network. Our model considers spatio-temporal information and the social relationships between users who visit the clustered places. After formally defining the model and the distance measure it relies on, we provide alternatives to our model and the distance measure. We evaluate the effectiveness of our model via a case study on real data; in addition, we design two quantitative measures, called social entropy and community score to evaluate the quality of the discovered clusters. The results show that temporal-geo-social clusters have special properties and cannot be found by applying simple spatial clustering approaches and other alternatives. Index Terms—Clustering, Similarity Measurements, Algorithms. 1 I NTRODUCTION C LUSTERING is commonly used as a method for data explo- ration, characterization, and summarization. Density-based clustering [1], in particular, divides a large collection of points into densely populated regions and it is the most appropriate clustering paradigm for spatial data, which have low dimensionality [2]. Density-based clusters have arbitrary shapes and sizes and exclude objects in areas of low density (i.e., outliers). The DBSCAN model [1] finds the spatial eps-neighborhood of each point p in the dataset, which is a circular region centered at p with radius eps. If the eps-neighborhood of p is dense, meaning that it contains no less than MinPts places, p is called a core point. Dense eps- neighborhoods are put into the same cluster if they contain the cores of each other. In this paper, we investigate the extension of traditional density-based clustering for spatial locations to consider their relationship to a social network of people who visit them and the time when they were visited. In specific, we consider the places of a Geo-Social Network (GeoSN) application, which allows users to capture their geographic locations and share them in the social network, by an operation called checkin. Online social networks with this functionality include Gowalla 1 , Foursquare 2 , and Facebook Places 3 . A checkin is a triplet uid , pid , time modeling the fact that user uid visited place pid at a certain time . We define the new problem of D ensity-based C lustering P laces in G eo-S ocial Networks (DCPGS), to detect geo-social clusters in GeoSNs. DCPGS extends DBSCAN by replacing the Euclidean D. Wu is with the College of Computer Science & Engineering, Shenzhen University, China. E-mail: [email protected] J. Shi is with the Lenovo Big Data Lab, Hong Kong E-mail: [email protected] N. Mamoulis is with the Department of Computer Science, The University of Hong Kong, Hong Kong. E-mail: [email protected] Manuscript received 2016; revised . 1. http://gowalla.com 2. https://foursquare.com 3. https://www.facebook.com/about/location distance threshold eps for the extents of dense regions by a thresh- old , which considers both the spatial and the social distances between places. For two places p i and p j , the spatial distance is considered to be the Euclidean distance between p i and p j , while the social distance should consider the social relationships between the two sets of users U p i and U p j which have checked in p i and p j , respectively. We define and use such a social distance measure, based on the intuition that two places are socially similar if they share many common users in their checkin records or the users in these records are linked by friendship edges. Figure 1(a) illustrates the data of a GeoSN that includes eight users (u 1 u 8 ) and two places (p i and p j ). The dashed lines represent user friendships and the solid lines annotated with timestamps (t 1 t 9 ) illustrate checkins of users at places. For instance, user u 1 is a friend with u 2 and u 3 and has visited place p i at time t 3 and p j at t 5 . As Figure 1(b) shows, each place is modeled by its spatial coordinates (e.g., la i , lo i for the latitude and longitude of p i ) and the set of users that have visited it (e.g., U p i for p i ). For the spatial distance between p i and p j , we can use the Euclidean distance between la i , lo i and la j , lo j , while for the social distance component we use the set of common users in U p i and U p j (e.g., {u 1 ,u 2 }) and the users in one place’s record who are friends with visitors of the other place (e.g., u 3 U p i and u 6 U p j who are friends with each other). The intuition is that users who are friends can influence each other to visit the places included in their checkin history. The details of our distance measure are presented in Section 2. In our previous work [3], the time of the check-ins made by users is not considered during the clustering process. The social connections established between places may be either out-dated or based on distant checkins in terms of time. For instance, the place clusters discovered according to the checkins one year ago may not interest analysts that value recent relationships between places. In addition, a cluster may have low quality if the check- in times of the different places in it vary significantly. In this paper, we investigate three methods of incorporating the temporal information (i.e., when did users checkin at the various places)

Upload: others

Post on 26-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    1

    Density-based Place Clustering UsingGeo-Social Network Data

    Dingming Wu, Jieming Shi, and Nikos Mamoulis,

    Abstract—Spatial clustering deals with the unsupervised grouping of places into clusters and finds important applications in urbanplanning and marketing. Current spatial clustering models disregard information about the people and the time who and when arerelated to the clustered places. In this paper, we show how the density-based clustering paradigm can be extended to apply on placeswhich are visited by users of a geo-social network. Our model considers spatio-temporal information and the social relationshipsbetween users who visit the clustered places. After formally defining the model and the distance measure it relies on, we providealternatives to our model and the distance measure. We evaluate the effectiveness of our model via a case study on real data; inaddition, we design two quantitative measures, called social entropy and community score to evaluate the quality of the discoveredclusters. The results show that temporal-geo-social clusters have special properties and cannot be found by applying simple spatialclustering approaches and other alternatives.

    Index Terms—Clustering, Similarity Measurements, Algorithms.

    F

    1 INTRODUCTION

    C LUSTERING is commonly used as a method for data explo-ration, characterization, and summarization. Density-basedclustering [1], in particular, divides a large collection of points intodensely populated regions and it is the most appropriate clusteringparadigm for spatial data, which have low dimensionality [2].Density-based clusters have arbitrary shapes and sizes and excludeobjects in areas of low density (i.e., outliers). The DBSCAN model[1] finds the spatial eps-neighborhood of each point p in thedataset, which is a circular region centered at p with radius eps.If the eps-neighborhood of p is dense, meaning that it containsno less than MinPts places, p is called a core point. Dense eps-neighborhoods are put into the same cluster if they contain thecores of each other.

    In this paper, we investigate the extension of traditionaldensity-based clustering for spatial locations to consider theirrelationship to a social network of people who visit them and thetime when they were visited. In specific, we consider the placesof a Geo-Social Network (GeoSN) application, which allowsusers to capture their geographic locations and share them inthe social network, by an operation called checkin. Online socialnetworks with this functionality include Gowalla1, Foursquare2,and Facebook Places3. A checkin is a triplet ⟨uid ,pid , time⟩modeling the fact that user uid visited place pid at a certain time .

    We define the new problem of Density-based Clustering Placesin Geo-Social Networks (DCPGS), to detect geo-social clusters inGeoSNs. DCPGS extends DBSCAN by replacing the Euclidean

    ● D. Wu is with the College of Computer Science & Engineering, ShenzhenUniversity, China.E-mail: [email protected]

    ● J. Shi is with the Lenovo Big Data Lab, Hong KongE-mail: [email protected]

    ● N. Mamoulis is with the Department of Computer Science, The Universityof Hong Kong, Hong Kong.E-mail: [email protected]

    Manuscript received 2016; revised .1. http://gowalla.com2. https://foursquare.com3. https://www.facebook.com/about/location

    distance threshold eps for the extents of dense regions by a thresh-old �, which considers both the spatial and the social distancesbetween places. For two places pi and pj , the spatial distanceis considered to be the Euclidean distance between pi and pj ,while the social distance should consider the social relationshipsbetween the two sets of users Upi and Upj which have checked inpi and pj , respectively. We define and use such a social distancemeasure, based on the intuition that two places are socially similarif they share many common users in their checkin records or theusers in these records are linked by friendship edges. Figure 1(a)illustrates the data of a GeoSN that includes eight users (u1–u8) and two places (pi and pj). The dashed lines represent userfriendships and the solid lines annotated with timestamps (t1–t9)illustrate checkins of users at places. For instance, user u1 is afriend with u2 and u3 and has visited place pi at time t3 and pjat t5. As Figure 1(b) shows, each place is modeled by its spatialcoordinates (e.g., ⟨lai, loi⟩ for the latitude and longitude of pi)and the set of users that have visited it (e.g., Upi for pi). Forthe spatial distance between pi and pj , we can use the Euclideandistance between ⟨lai, loi⟩ and ⟨laj , loj⟩, while for the socialdistance component we use the set of common users in Upi andUpj (e.g., {u1, u2}) and the users in one place’s record who arefriends with visitors of the other place (e.g., u3 ∈ Upi and u6 ∈ Upjwho are friends with each other). The intuition is that users whoare friends can influence each other to visit the places includedin their checkin history. The details of our distance measure arepresented in Section 2.

    In our previous work [3], the time of the check-ins made byusers is not considered during the clustering process. The socialconnections established between places may be either out-datedor based on distant checkins in terms of time. For instance, theplace clusters discovered according to the checkins one year agomay not interest analysts that value recent relationships betweenplaces. In addition, a cluster may have low quality if the check-in times of the different places in it vary significantly. In thispaper, we investigate three methods of incorporating the temporalinformation (i.e., when did users checkin at the various places)

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    2

    u4u8

    u7u6

    u5u2u1

    u3

    pjpi

    t3

    t1 t2

    t4 t5t6 t7

    t8t9

    (a) A toy example

    u4u3u2u1

    u6u5u2u1

    u7

    Upi

    u8

    Upj

    pi : pj : (b) Abstraction

    Fig. 1. Example and storage structure of GeoSNs

    in the process of geo-social clustering, i.e., (1) history-frame geo-social clustering, (2) damping window, and (3) temporally con-tributing users, yielding temporal-geo-social clusters. The threemethods of incorporating temporal information result in differentclusters that satisfy different analysis needs. The history-framemethod reveals how clusters evolve over time; its result can beused to analyze relationships between places at different periods oftime. The damping window method weighs the recently checkinshigher than old ones, so it generates clusters relevant to the currenttime. The temporally contributing user method only considersthe checkins that have been made within a period of time, andignores distant checkins. Hence, the places in the same cluster arevisited by socially related users in the same periods of time, whichindicates a stronger relationship between them. We compare thetemporal-geo-social clusters and the geo-social clusters based onvisualization and social entropy evaluation. The result shows thatthe temporal information helps improving the quality of clusters.

    To the best of our knowledge, there is no previous work onclustering GeoSN places. While there exists a significant body ofresearch on analyzing and querying GeoSN data [4], [5], [6], [7],most of these works are centered around users; i.e., they studyuser behavior, user link prediction or recommendation, or theevaluation of user queries. Thus, the places and checkins are onlyregarded as some auxiliary information to facilitate user-centeredanalysis. On the other hand, GeoSNs provide a new and rich formof geographical data, affiliated with the social network graph,the analysis of which can provide new and interesting insights,compared to raw spatial data. In specific, clustering of places in aGeoSN network finds a number of interesting applications:Generalization and characterization of places. In geographicdata analysis, a common task is to define regions (especially inurban areas), which include similar places with respect to thepeople who live in them or visit them. For example, in urbanplanning, land managers are interested in identifying regionswhich have uniform (i.e., consistent) demographic statistics, e.g.,areas where elderly people prefer to visit, or people who belongto certain religious communities and have special transportationor living needs. Our DCPGS framework is especially useful forsuch spatial generalization and characterization tasks, because itcan identify geographic regions where places form dense regionsand the people who visit them are also socially connected to eachother. By considering also the temporal information in the data(i.e., when did users checkin at the various places), the discoveredclusters can be further refined and can become valuable for urbanactivity analysis, local authorities, service providers, decisionmakes, etc. For example, a certain set of places (e.g., shoppingspots) may be characterized as a cluster for only restricted timeperiods or intervals (e.g., during Saturday morning hours).Data cleaning. Intuitively, places that belong in the same geo-

    social cluster, according to our DCPGS framework, should havesimilar semantics. Therefore, our clustering results can help to-ward the cleaning of semantics (e.g., tags), which are givento places being in the same cluster (e.g., inspect tags that areinconsistent with the ones given to the majority of places inthe cluster). In addition, as already mentioned, nearby GeoSNlocations collected by user checkins could belong to the samephysical place (e.g., a large restaurant) and our clusters can helptoward identifying such cases and integrating multiple locations tothe same physical place (i.e., region), possibly with the joint helpof map-matching tools. Besides, taking the temporal dimensioninto account can also help to provide more accurate cleaning.Marketing. GeoSN places may be commercial (e.g., restaurants).The fact that two (or more) such places belong to the samegeo-social cluster indicates that there is a high likelihood that auser who likes one place would also be interested to visit theother(s). Therefore, by having knowledge of a place’s geo-socialcluster, the management of the place may initiate campaigns tousers who visited other places in the same cluster, or a set ofplaces could do collaborative promotion (e.g., a discount for userswho visit multiple places in the cluster). In addition, the user-groups that are relevant to a cluster could be relative to certaintime periods. For example, shopping places in downtown arevisited during the evening by people who have to work andcould not shop at daytime, while supermarkets and small shopsin the suburbs are usually visited by housewives in the daytime.Such geo-social-temporal clusters can be useful to marketing oradvertising companies, which may benefit from understanding the(time sensitive) shopping habits of various social groups.

    Compared to conventional density-based spatial clusters, geo-social clusters detected by DCPGS exhibit larger intra-cluster so-cial strength, as we confirm experimentally in Section 3. DCPGSalso has additional advantages. First, DCPGS uses geo-socialsplitting criteria; for example, DCPGS splits clusters, which arespatially dense, but they are separated by barriers, such as rivers orwalls, or visitors’ weak social connections. Second, DCPGS findsspatially loose clusters that include sets of places that (pairwise)are not very close to each other (and therefore violate a typicallytight spatial density threshold), but have very tight social relation-ships with each other. Such places satisfy our DCPGS criteria.On the other hand, DBSCAN is less flexible in including them inthe same cluster as loosening its spatial distance threshold wouldresult in putting everything in a single huge cluster. Third, DCPGScan discover geo-social clusters with fuzzy spatial boundaries;such clusters cannot be identified by spatial clustering, whichdefines strict spatial boundaries between clusters. Our evaluationis based on case studies and on the use of quantitative measuresthat we also propose in this paper. In addition, we demonstratethat the social distance measure we propose and use in DCPGSis more effective compared to alternative measures (based onnode-to-node proximity). Overall, the results of our evaluationindicate that the social relationships between users who visitplaces have great impact in the clustering of places and cannot beoverlooked. Furthermore, considering the temporal information inthe clustering not only reveals how the clusters evolve with time,but also helps finding groups of places having more coherent socialrelationships.

    Summing up, the contributions of this paper are as follows:

    ● We propose and formulate the problem of density-basedclustering GeoSN places.

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    3

    ● We define a simple but effective social distance measurebetween places in GeoSNs.

    ● We demonstrate the effectiveness of DCPGS by case studiesand quantitative evaluation through two quality measures thatare also devised in this paper.

    ● We study three ways of incorporating temporal informationin DCPGS and evaluate the resulting temporal-geo-socialclusters.

    The rest of the paper is organized as follows. Section 2 for-mulates the DCPGS problem, defines the social distance measurebetween places that we use, and introduces three methods of in-corporating temporal information into DCPGS. The effectivenessof our framework are analyzed in Sections 3. Related work isreviewed in Section 4. Finally, Section 5 concludes this paper anddiscusses future work. The efficient algorithms in our previouswork [3] for clustering GeoSN places and the performance evalu-ations are presented in Appendices B and C.

    2 MODEL AND DEFINITIONSOur data input includes three components: a social network, aset of places and the checkins of users to the places. The socialnetwork is an undirected graph G = (U,E), where U is the set ofall users and each edge (ui, uj) ∈ E indicates that users ui, uj ∈U are friends. Set P is the set of all places visited by users, inthe form of ⟨latitude, longitude⟩ GPS points. Thus, identifiers areassigned to places according to their distinct GPS coordinates. SetCK = {⟨ui, pk, tr⟩∣ui ∈ U and pk ∈ P} includes all checkinsgenerated by users in U . A checkin in CK is a triplet ⟨ui, pk, tr⟩modelling the fact that user ui visited place pk at a certain tr .For a place pk, the set Upk of visiting users of pk is defined byUpk = {ui∣⟨ui, pk,∗⟩ ∈ CK}, where ∗ means any time. Figure1(b) shows Upi and Upj for the two places pi and pj of the toyexample in Figure 1(a). The figure also connects the user pairsin the two sets who are linked by friendship edges in the socialnetwork. Note that user u8 does not belong to either Upi or Upj ,but connects users u4 and u7 in the social graph.

    2.1 DCPGS Model

    Our Density-based Clustering Places in Geo-Social Networks(DCPGS) model extends the model of DBSCAN [1]; foreach place pi in the GeoSN, DCPGS finds the geo-social �-neighborhood N�(pi) of pi, which includes all places pjsuch that Dgs(pi, pj) ≤ �, DS(pi, pj) ≤ τ , and E(pi, pj) ≤maxD . For two places pi, pj , E(pi, pj) is the Euclidean dis-tance, DS(pi, pj) is the social distance, and Dgs(pi, pj) =f(DS(pi, pj),E(pi, pj)) is the geo-social distance, defined asa function of E(pi, pj) and DS(pi, pj). Parameter � is geo-social distance threshold, while τ and maxD are two sanityconstraints for the social and the spatial distances between places,respectively. We will give detailed definitions for all above dis-tance functions and parameters later on. If the geo-social �-neighborhood of a place pi contains at least MinPts places, thenpi is a core place. In DCPGS, w.r.t. � and MinPts ; a place pi isdirectly density-reachable from a place pj if pi ∈ N�(pj) andN�(pj) ≥ MinPts ; a place pi is density reachable from a placepj if there is a chain of places p1,⋯, pn, p1 = pi, pn = pj suchthat pk−1 is directly density-reachable from pk (1 < k ≤ n); aplace pi is density connected to a place pj if there is a place pk,such that both pi and pj are density-reachable from pk. A cluster

    C in DCPGS w.r.t. � and MinPts is a non-empty subset of placessatisfying the following conditions: 1) ∀pi, pj : if pi ∈ C and pjis density-reachable from pi then pj ∈ C. 2) ∀pi, pj ∈ C: pi isdensity-connected to pj . Outliers are the places that do not belongto any cluster.Parameters. � and MinPts are the main parameters of DCPGS.MinPts (i.e., the minimum number of places in the neighborhoodof a core point) is set as in the original DBSCAN model (see [1]);a typical value is 5. � takes a value between 0 and 1, because, as weexplain later on, we defineDgs(pi, pj) to take values in this range.Since the geo-social distance Dgs(pi, pj) is a function of a spatialand a social distance, τ and maxD constrain these individualdistances to avoid the following two cases that negatively affectthe quality of geo-social clusters.

    ● The geo-social distance between two places pi and pj couldbe less than � if they are extremely close to each other inspace, but have no social connection at all. This may leadto putting places close to each other spatially, but having nosocial relationship, into the same cluster.

    ● The geo-social distance between two places pi and pj couldbe less than � if they have very small social distance, but theyare extremely far from each other spatially. This may leadto putting places with close social distances, but large spatialdistances, into the same cluster.

    Constraints τ and maxD are defined for quality control and canbe set by experts or according to the analyst’s experience. Weexperimentally study how clustering quality is affected by the twoconstraints and � in Section 3.Distance Functions. The social distance DS(pi, pj) takes asinputs the sets of users Upi and Upj who have visited pi and pj ,respectively, and returns a value between 0 and 1. In Section 2.2,we present our definition for DS(pi, pj) and alternative waysto define it based on previous work. Before defining the geo-social distance Dgs(pi, pj), we convert the Euclidean distanceE(pi, pj) into a spatial distance DP (pi, pj) =

    E(pi,pj)maxD so that

    any place pj in the geo-social �-neighborhood of pi has spatialdistance no larger than 1. Finally, Dgs(pi, pj) is defined asweighted sum of DS(pi, pj) and DP (pi, pj), i.e.,

    Dgs(pi, pj) = ω ⋅DP (pi, pj) + (1 − ω) ⋅DS(pi, pj), (1)

    where ω ∈ [0,1].

    2.1.1 Alternatives to DCPGSUsing the proposed geo-social distance, our place clustering model(DCPGS) extends density-based clustering in spatial databases.We next present two alternatives to DCPGS, which can be alsoused to cluster GeoSN places.SNN-based Clustering. The SNN clustering algorithm [8] isan improvement of DBSCAN that can find clusters of widelydiffering shapes, sizes, and densities. It first finds the k nearestneighbors of each data point and then redefines the similaritybetween pairs of points in terms of how many nearest neighborsthe two points share. Using this definition of similarity, the SNNalgorithm identifies core points and then builds clusters around thecore points as DBSCAN does.

    We extend SNN to cluster places in the geo-social network asfollows. Firstly, the neighborhood of a place p in SNN is defined asNN (p) = {q ∈ P ∧ q ≠ p∣E(p, q) ≤ maxD and DS(p, q) ≤ τ}.Then, according to SNN, the similarity between two places piand pj is defined as the cardinality of the common places in

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    4

    their neighborhood, i.e., Ssnn(pi, pj) = ∣NN (pi) ∩ NN (pj)∣.Given a user specified similarity parameter epssnn , the SNN geo-social neighborhood of a place pi is defined as Nsnn(pi) ={pj ∈ P ∣Ssnn(pi, pj) ≥ epssnn}. Place pi is a core place if∣Nsnn(pi)∣ ≥MinPtssnn , where MinPtssnn is the user specifieddensity threshold. If two core places are within each other’s SNNgeo-social neighborhood, then they are put into the same cluster.Graph-based Clustering. GeoSN places can alternatively beclustered by the use of graph clustering models. The main idea ofsuch a model is to construct a place network PN , which connectsplaces according to their social and spatial distances and thenapply an off-the-shelf community detection algorithm on PN .Specifically, given two places pi and pj , if E(pi, pj) ≤ maxDand DS(pi, pj) ≤ τ , an undirected and weighted edge with geo-social weight Wgs(pi, pj) = 1−Dgs(pi, pj) is added between piand pj . Community detection algorithms like Link Clustering [9],[10] or Metis [11] can then be applied to derive the clusters.Link Clustering constructs a dendrogram of network communities(that may overlap) in a hierarchical manner. Metis is anothermultilevel graph partition paradigm that includes three phases:graph coarsening, initial partitioning, and uncoarsening; Metisdivides a network into k non-overlapping communities. As wewill show in Section 3, these graph clustering methods are inferiorto DCPGS.

    2.2 Social Distance Between PlacesThe social distance DS(pi, pj) between pi and pj naturallydepends on the social network relationships between the sets Upiand Upj of users who visited pi and pj , respectively. DS(pi, pj)is based on the set CU ij of contributing users between two placespi and pj :

    Definition 1. (Contributing Users) Given two places pi and pjwith visiting users Upi and Upj , respectively, the set of contribut-ing users CU ij for the place pair (pi, pj) is defined as

    CU ij = {ua ∈ Upi ∣ua ∈ Upj or ∃ub ∈ Upj , (ua, ub) ∈ E}∪ {ua ∈ Upj ∣ua ∈ Upi or ∃ub ∈ Upi , (ua, ub) ∈ E} (2)

    Specifically, if a user ua has visited both pi and pj , then ua isa contributing user. Also if ua has visited place pi, ub has visitedpj , and ua and ub are friends, both ua and ub are contributingusers. Users in CU ij contribute positively (negatively) to thesocial similarity (distance) between pi and pj . Formally:

    Definition 2. (Social Distance) Given two places pi and pjwith visiting users Upi and Upj , respectively, the social distancebetween pi and pj is defined as

    DS(pi, pj) = 1 −∣CU ij ∣

    ∣Upi ∪Upj ∣(3)

    The above definition of DS(pi, pj) takes both the set sim-ilarity between sets Upi and Upj and the social relationshipsamong users in Upi and Upj into account. In addition, the distancemeasure penalizes pairs of places pi and pj which are popular(i.e., Upi and/or Upj are large) but their set of contributingusers is relatively small (see Equation 3). The reason is thatsuch place pairs are not characteristic to their (loose) socialconnections. As an example, consider places pi and pj of Figure1. To compute DS(pi, pj), we first set Upi = {u1, u2, u3, u4}and Upj = {u1, u2, u5, u6, u7}. All users in Upi and Upj arechecked one by one to obtain the contributing users between pi

    and pj . We derive CU ij = {u1, u2, u3, u5, u6}, since (i) bothu1 and u2 have visited pi and pj , (ii) user u3, who visited pi,has a friend u6 who visited pj , (iii) symmetrically, user u6, whovisited pj , has a friend u3 who visited pi, and (iv) u5 (∈ Upj )has a friend u2 having been to pi. According to Definition 2,the social distance DS(pi, pj) between pi and pj in Figure 1 is1 − ∣CU ij ∣/(∣Upi ∪Upj ∣) = 1 − 5/7 ≈ 0.2857.

    Observe that only direct friendship edges between users of Upiand Upj are considered in our social distance definition. Longernetwork paths, such as friend-of-friend relationships, are ignored(e.g., the case of users u4 and u7 in Figure 1(b) who are connectedvia user u8). According to the small world effect [12], a user ina social network can reach a large portion of other users withinonly few hops. For instance, the 90-percentile effective diameter[13] of Gowalla GeoSN, used in our experiments, is just 5.7.4 Thismeans within quite a few hops most users can reach a very largepercentage of all users. In Gowalla, only 8 users can access morethan 1% of all the users in 1 hop, while 40516 users (20.61%of all the users) can access more than 1% of all the users in 2hops and 141582 users (72.02% of all the users) can reach morethan 1% of all the users in 3 hops. The number of users who canreach more than 1% users in 2 or 3 hops increases dramaticallycompared to the percentage of those visited within 1 hop. Thus,paths longer than 1 hops are too common and cannot be consideredas (indirect) user relationships; i.e., their impact is much weakercompared to direct friendship edges. Hence, Definition 2 intro-duces a simple, but powerful social distance measure. Propertiesof DS(pi, pj) include symmetry (i.e., DS(pi, pj) = DS(pj , pi))and self-minimality (i.e., DS(pi, pj) ∈ [0,1], and DS(pi, pj) = 0for Upi = Upj ). On the other hand, DS(pi, pj) does not obeythe triangular inequality, but this does not affect our clusteringalgorithm.

    2.2.1 Alternatives to DS

    Our DCPGS model is independent of the social distance definitionbetween places (i.e., DS). As an alternative to our Definition 2,the following measures can be used. In Section 3, we evaluate theeffectiveness of these alternatives.Jaccard. Based on the Jaccard similarity J(pi, pj) = (∣Upi ∩Upj ∣)/(∣Upi ∪ Upj ∣) between the sets of visiting users for pi andpj , we define DJacS (pi, pj) = 1−J(pi, pj), which is not intuitive;it disregards the social network, assuming that two users who arefriends do not affect each other in visiting GeoSN places.SimRank. SimRank is a structural-context model for measuringthe similarity between nodes in a graph. The idea is that twonodes are equivalent if they relate to equivalent nodes. We candefine a SimRank-based social distance DsimS (pi, pj), using theMinimax version of SimRank [14].5 This measure compares eachof pi’s visiting users u

    pir with the visiting user u

    pjs of pj who is

    the most similar to upir , to compute the similarity between placess(pi, pj). The similarity between users s(upir , u

    pjs ) is computed

    in an analogous way. Specifically, DsimS (pi, pj) = 1 − s(pi, pj),where s(pi, pj) =min(spi(pi, pj), spj(pi, pj)),spi(pi, pj) =

    φ∣Upi ∣

    ∑ur∈Upi

    maxus∈Upj

    s(ur, us), where φ = 0.8 is a

    decay factor [14], s(ur, us) = min(sur(ur, us), sus(ur, us)),

    4. http://snap.stanford.edu/data/index.html5. The original SimRank measure is only meant for node-to-node similarity;

    in our case, we need a measure between Upi and Upj .

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    5

    and assuming that Pur is the set of places visited by ur,sur(ur, us) =

    φ∣Pur ∣

    ∑pi∈Pur

    maxpj∈Pus

    s(pi, pj).

    Katz. The Katz similarity measure [15] sums over all possiblepaths from user ur to us with exponential damping by length, i.e.,

    K(ur, us) =∞∑l=1βl∣paths lur,us ∣, where paths

    lur,us

    is the set of all

    length-l paths from ur to us, and damping factor β is typicallyset to 0.05. Due to the poor scalability of this measurement, inpractice only paths up to length L are considered [16]; i.e., an

    approximated Katz score Ka(ur, us) =L

    ∑l=1βl∣paths lur,us ∣ can be

    used. Accordingly, we can define a Katz-based social distancebetween places pi and pj , by averaging the normalized Katzsimilarities between all pairs of users from Upi and Upj :

    DKatzS (pi, pj) = 1 −1

    ∣Upi ∣∣Upj ∣∑

    ur∈Upi

    ∑us∈Upj

    Ka(ur, us)

    CommuteTime. The hitting time h(ur, us) from ur to us is theexpected number of steps required for a random walk starting at urto reach us. The commute time between ur and us is defined byct(ur, us) = h(ur, us) + h(us, ur). However, the commute timeis sensitive to long paths and favors nodes of high degree. Thus,the truncated commute time [17], which considers only paths oflength no longer than L, can be used to model the social distancebetween a pair of users. Finally, we can define a commute timebased social distance between places pi and pj as follows:

    DctS (pi, pj) =1

    ∣Upi ∣∣Upj ∣∑

    ur∈Upi

    ∑us∈Upj

    ctL(ur, us)

    where ctL(ur, us) is the normalized truncated commute time.

    2.3 Incorporating Temporal InformationA checkin in GeoSNs is a triplet ⟨u, p, time⟩ modeling the factthat user u visited place with point location p = ⟨x, y⟩ at acertain time . The geo-social clusters found by the DCPGS model(presented in the previous section) compute the social distancebetween places based on the social network relationships betweenthe visiting user sets of the places, while the temporal informationis disregarded. Taking the advantage of the temporal informationin the checkin data may help improving the quality of discoveredgeo-social clusters. In this section, we investigate how the tem-poral information affects the clustering result. We investigate thediscovery of temporal-geo-social clusters in GeoSNs, which arespatio-temporal regions visited by groups of socially connectedusers. In order to compute such temporal-geo-social clusters,we extend the definition of social distance between places to atemporal-social distance DTS . Using the temporal-social distanceDTS , the DCPGS model can then replace the geo-social distanceDgs by a newly defined temporal-geo-social distance as follows:

    Dtgs = ω ⋅DP (pi, pj) + (1 − ω) ⋅DTS(pi, pj) (4)

    An intuitive definition of the temporal-social distance DTS wouldbe to consider a pair of places temporal-socially close if theyshare many socially connected visiting users that have checkedin the places within a small time period. On the other hand, twoplaces are temporal-socially far from each other if they do nothave socially connected visitors within a short time interval. Thetemporal dimension captures the evolution of place visits, andthus reflects the changes of the social distance between places.Based on the above, we suggest that the following three possibledefinitions of DTS should be investigated.

    2.3.1 History-Frame Geo-Social ClusteringThis method performs geo-social clustering for each time periodseparately. The time period can be either continuous or peri-odic (i.e., calendric). For example, we can generate a differentclustering of places for each month, by only using the checkindata recorded in that month. We can also generate clusterings forworking days and weekends. The clustering results in consecutivecontinuous periods can be used to study the evolution of clustersover time. It is also possible to track which place enters or leavesa cluster at a particular month and which parts of the clustersare time-insensitive. Similarly, calendric clusters of different timeperiods can be compared; for instance, some clusters may onlyappear in holiday periods.

    2.3.2 Damping WindowRecall that the social distance (Equation 3) between two places iscalculated based on the users who checked in the two places. Thesocial distance between two places is small if their visiting usersare socially connected well. In this method, given a place p, eachuser u who has visited p is assigned an exponential decay factoruw(p) = exp−x, where x = (tc − t(u, p))/T , tc is the currenttime, t(u, p) is the last time when user u checked in place p, andT is the time range of the data (i.e., tc minus the earliest timeof the data), which is used for normalization purpose. The userswho made checkins recently are weighed high. This method favorsplace pairs to which the socially connected users have paid recentvisits. The temporal-social distance is defined asDTS(pi, pj) = 1 − ( ∑

    u∈CUij∩Upi∩Upj

    max{uw(pi), uw(pj)}

    + ∑u∈CUij∩(Upi∖Upj )

    uw(pi)

    + ∑u∈CUij∩(Upj∖Upi)

    uw(pj))/∣Upi ∪Upj ∣

    As an example, consider the social network and places pi andpj in Figure 1. Table 1 shows the visiting users’ check-in time. Letthe current time be 10 and the time range of the data be T = 10.Table 2 shows the exponential decay factors of the visiting users.For example, the exponential decay factor of user u4 for placepi is computed as uw4 (pi) = exp−(10−3)/10 = 0.5. According toDefinition 1, CUij = {u1, u2, u3, u5, u6}. Hence, the temporal-social distance between places pi and pj is DTS(pi, pj) = 1 −(0.5 + 0.82 + 0.61 + 0.67 + 0.55)/7 = 0.55.

    TABLE 1Checkin Time

    Checkin Time u1 u2 u3 u4 u5 u6 u7pi 2 8 5 3 - - -pj 3 7 - - 6 4 9

    TABLE 2Exponential Decay Factor

    exp−x u1 u2 u3 u4 u5 u6 u7pi 0.45 0.82 0.61 0.50 - - -pj 0.50 0.74 - - 0.67 0.55 0.90

    2.3.3 Temporally Contributing UsersIn this method, we consider temporally contributing users that aresocially connected users who checked in places pi and pj withina time interval θ.

    Definition 3. (Temporally Contributing Users) Let E be the edgeset of the social network. Given two places pi and pj with visiting

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    6

    user sets Upi and Upj , the contributing user set CUij is definedin Definition 1. The temporally contributing user set is a subsetof the contributing user set, i.e., TCUij ⊆ CUij . Each user inthe temporally contributing user set satisfies one of the followingconditions ∀u ∈ TCUij

    1) ∣t(u, pi) − t(u, pj)∣ ≤ θ2) ∃u′ ∈ TCUij ((u,u′) ∈ E ∧ ∣t(u, pi) − t(u′, pj)∣ ≤ θ)3) ∃u′ ∈ TCUij ((u,u′) ∈ E ∧ ∣t(u′, pi) − t(u, pj)∣ ≤ θ)

    The temporal-social distance between places pi and pj is definedas DTS(pi, pj) = 1 − ∣TCUij ∣/∣Upi ∪Upj ∣.

    3 QUALITATIVE ANALYSISThis section analyzes the quality of the geo-social clusters and thetemporal-geo-social clusters discovered by our proposed frame-work. Firstly, we compare with two extreme versions of DCPGS:PureSocialDistance applies density-based clustering by using thesocial distance DS(pi, pj) only, while DBSCAN uses only theEuclidean distance E(pi, pj). This comparison shows the appro-priateness of using both social and spatial distances in clustering.In the implementation of PureSocialDistance, we do not put placepairs with spatial distance more than 1000 meters in the samecluster; otherwise this method becomes too expensive. Secondly,we compare DCPGS with the graph clustering and SNN-basedapproaches discussed in Section 2.1.1, in order to demonstratethe suitability of the density-based clustering model for thisapplication. Thirdly, we assess the suitability of our social distancemeasure (Section 2.2) by evaluating versions of DCPGS, whichuse the alternative social distance definitions discussed in Section2.2.1. Finally, we analyze the temporal effects on the geo-socialclusters found by DCPGS. All tested methods were implementedin C++ and the experiments were performed on a 3.4 GHz quad-core machine running Ubuntu 12.04 with 16 GBytes memory.Data. We use two publicly available datasets6 from historicalgeo-social networks. Gowalla contains a social network with∣U ∣ =196,591 users and ∣E∣ =950,327 undirected friendship edges.There are ∣CK ∣ =6,442,892 checkins performed by those userson ∣P ∣ =1,280,969 places over a period from Feb. 2009 toOct. 2010. Brightkite includes a social network of ∣U ∣ =58,228users and ∣E∣ =214,078 undirected friendship edges. It contains∣CK ∣ =4,491,143 checkins on ∣P ∣ =772,783 distinct places col-lected over the period from Apr. 2008 to Oct. 2010.Default Parameter Settings. The density requirement of the clus-tering is determined by parameters MinPts and � (or DBSCAN’seps). We set MinPts = 5 for all approaches; various densitysettings can be achieved by just tuning � (or DBSCAN’s eps).For instance, a large MinPts has similar effect as a small �(or DBSCAN’s eps). By default, parameter ω in the geo-socialdistance is set to 0.5 to equally weigh the social and spatialdistances. By default, parameter τ is set to 0.7, and maxD isset to 100 meters for dataset Gowalla and 120 meters for datasetBrightkite. In the rest of the paper, the length values of the spatialdistance are in the unit of meter, and thus we omit “meter” whencontext is clear.

    3.1 Visualization-based AnalysisWe first visualize and compare the clusters found by DCPGS andalternative approaches in the area of Manhattan on the Gowalla

    6. downloaded from snap.stanford.edu/data/index.html

    dataset and in the area of Chicago on the Brightkite dataset.Figures 2(a)-(c) show the clusters by DCPGS, DBSCAN (whichdisregards the social network behind the places) and PureSocialD-istance (which disregards the spatial information). DCPGS findsgeo-social clusters with the following features.

    Geo-Social Splitting/Merging Criteria. Geo-social clusters thatare very close to each other are split correctly by DCPGS, whileDBSCAN may consider them as a single cluster due to theirspatial closeness; in other cases, clusters split by DBSCAN due torelatively low spatial density between them are merged by DCPGSbecause of their strong social ties. For example, consider regionA in Figures 2(a) and the corresponding region A′ in Figure 2(b),where DCPGS and DBSCAN detect clusters with totally differentlayouts. By tuning the parameters of DBSCAN, we are not ableto find the clusters found by DCPGS, because the densities ofthe two clusters in region A are similar and the two clusters areclose to each other. Thus, DBSCAN can only consider the placesin region A′ as either a single cluster or as several fragmentedclusters (Figure 2(b)), under different parameter settings. In certaincases, spatially dense clusters may be split by DCPGS because ofsome natural barriers, such as rivers, and walls. These barriersmake it inconvenient to travel from one side to the other, resultingin a splitting effect. As an example, in Figure 3, a cluster (regionD) found by DBSCAN is split into two DCPGS clusters (regionsD1 and D2) by the river, since the users on different river sidesare proved to have weak social connection. While it is possible forDBSCAN to find the two DCPGS clusters by reducing the value ofeps , its parameter settings in this case make some existing clustersdisappear, resulting in too many outliers.

    Spatially Loose Clusters. Some geo-social clusters detected byDCPGS in region B of Figure 2(a) are considered as outliers byDBSCAN in the corresponding region B′ of Figure 2(b). RegionB′ is spatially too sparse to satisfy the density requirement ofDBSCAN, and thus most places inside it are filtered out as out-liers. However, the users who checked in those places have strongsocial relationships. Hence, geo-social clusters are discovered inregion B by DCPGS in Figure 2(a). While it is possible forDBSCAN to discover such spatially loose clusters by reducingthe density parameters, this would result in merging too manyclusters together, making denser clusters indistinguishable.

    Fuzzy Boundary Clusters. Some DCPGS geo-social clustershave fuzzy boundaries with each other, which is reasonable inthe real world, since groups of socially connected users mayspatially overlap. On the other hand, DBSCAN produces clusterswith strict boundaries. For instance, in Figure 2(a), there is nostrict boundary between the two clusters enclosed in region C.Although PureSocialDistance, which is the other extreme method,also produces clusters with fuzzy boundaries (see Figure 2(c)), theclusters are spatially indistinguishable and they are not interesting,i.e., for the applications mentioned in the Introduction.Alternatives to DCPGS. We visually analyzed the results ofthe alternatives to DCPGS, i.e., SNN-based clustering modeland graph-based clustering models (LinkClustering and Metis),described in Section 2.1.1. LinkClustering and Metis producesimilar results; indicatively, we show the clusters produced byLinkClustering in Figure 2(d). LinkClustering produces thousandsof small clusters (average size around 3), which are typically notwell-separated spatially. Due to the sparsity of geo-social networkdata, the constructed place network contains many connectedcomponents that are disconnected with each other (e.g., the place

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    7

    B

    A

    C

    (a) DCPGS: � = 0.4, τ = 0.7,maxD = 100

    A′

    B′

    (b) DBSCAN: eps = 40 (c) PureSocialDistance: � = 0.2, τ =1, maxD = 1000

    (d) LinkClustering: τ = 0.7,maxD = 100

    A

    C

    B

    (e) Jaccard: � = 0.4, τ = 0.7,maxD = 100

    A

    C

    B

    (f) SimRank: � = 0.3, τ = 0.7,maxD = 100

    ●●

    ●●

    ●● ●●

    ●●

    ●●●●●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●●●●●

    ●●

    ●●●●●●●

    ●●●

    ●●●●●

    ●●

    ●●

    ●●●

    ●●● ●●●

    ●●●● ●●

    ●●●● ●●●

    ●●●●●●●

    ● ●●●

    ●●

    ●●

    ●●

    ●●●●●●●●

    ●●

    ●●

    ●●

    ●● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●●

    ●●●●●

    ● ●●

    ●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97

    lon

    lat

    (g) SNN: epssnn = 1, maxD = 60

    ●●

    ●●●

    ●●●●●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●

    ●●●●●●●

    ●●●●●●●●

    ●●●●●●

    ●●●●●●●●●●●●●

    ●●●●●●●●

    ●●●●

    ●●

    ●●

    ●●●●●

    ●●●

    ●●●●●

    ●●

    ●●

    ●●●●●●●●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97

    lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    10

    20

    30

    40

    50

    d$cluster

    (h) SNN: epssnn = 2, maxD = 60

    ●●

    ●●

    ●●

    ●●●●●●

    ●●

    ●●● ●

    ●●

    ●●

    ●●●●●●

    ●●●●●

    ●●

    ●●

    ●●●

    ●●● ●

    ●●●●●

    ● ●●●

    ●●● ●●●

    ●●

    ●●

    ●●●●●●●●

    ●●

    ●●

    ●●●

    ●●●

    ●●●●●●●●●●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97

    lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    10

    20

    30

    40

    d$cluster

    (i) SNN: epssnn = 3, maxD = 60

    ● ●

    ●●

    ●●

    ●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97

    lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    2.5

    5.0

    7.5

    10.0

    d$cluster

    (j) SNN: epssnn = 1, maxD = 120

    ●●

    ●●●

    ●●●

    ●●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97

    lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    5

    10

    15

    20

    d$cluster

    (k) SNN: epssnn = 3, maxD = 120

    ●●

    ●●●●

    ●●

    ●●

    ●●●●

    ●●

    ●●●●●●

    ●●

    ●●●●

    ● ●● ●

    ●●●

    ●●●●●

    ●●

    ●●

    ●●●●●●●●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97

    lon

    lat

    10

    20

    30d$cluster

    d$names● 0

    1

    2

    3

    4

    5

    (l) SNN: epssnn = 5, maxD = 120

    Fig. 2. Place clusters of Gowalla found in Manhattan

    network built when τ = 0.7, maxD = 100, and ω = 0.5 contains34,496 connected components with 4.3 nodes and 8.2 edges onaverage). The clusters found by Metis are fewer and larger, butalso spatially indistinguishable. Metis ignores outliers; as a result,places belong to the same cluster may have low spatial proximityand social similarity.

    Figures 2(g)–2(l) show the clusters found by the SNN-basedclustering model. Since DCPGS and SNN-based model only differin the way of measuring the closeness/similarity between places,by carefully tuning the parameters of SNN-based model, we try

    to figure out whether it can discover similar clusters to those ofDCPGS. We observe that when maxD = 60 and epssnn = 1(Figure 2(g)), the SNN-based model indeed finds similar clustersin the areas highlighted by rectangles. As epssnn increases (Fig-ures 2(h) and 2(i)), the sizes of clusters decrease. This is becauselarge epssnn puts strict constraints on the places to be a memberof a cluster. When maxD = 120 and epssnn = 1 (Figure 2(j)),a big cluster is discovered by the SNN-based model. As epssnnincreases (Figures 2(k) and 2(l)), several clusters are identifiedfrom this big cluster. Although our geo-social distance can also be

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    8

    !

    (a) DBSCAN: eps = 60

    D2

    D1

    (b) DCPGS: � = 0.4 s.t. τ = 0.7,maxD = 120

    Fig. 3. Clusters of Brightkite found by DBSCAN and DCPGS in Chicago

    used in an extension of DBSCAN (DCPGS) or in an extension ofSNN for geo-social clustering, In most cases, DCPGS discoversclusters with better social entropy than SNN-based model does(Figure 5).Alternatives to DS . Finally, we analyzed the results of DCPGS,if our DS definition (Definition 2) is replaced by the alternativesdescribed in Section 2.2.1. For DKatzS and D

    ctS , we set L = 3;

    for bigger L values, these measures become extremely expensive.We observed that DJacS , D

    KatzS , and D

    ctS produce similar results

    to each other. Indicatively, Figure 2(e) shows the clusters foundby DCPGS if DJacS is used instead of DS . All these measuresproduce small clusters and too many outliers since they givelarge distance values for most pairs of places pi and pj . The setof common users for two places in Jaccard (i.e., Upi ∩ Upj ) isexpected to be small and the decay factor of Katz dampens theeffect of long connections between Upi and Upj . The expectedCommuteTime distance between places is also high due to theeffect of normalization.

    On the other hand, DsimS produces clusters of slightly largersizes compared to DS . We observed that the probability dis-tribution of DsimS is skewed towards low values, meaning thatmany pairs of places have low bipartite minimax SimRank socialdistance, because SimRank is based on the most similar pair ofvisiting users. The clustering results of SimRank (Figure 2(f)) andDCPGS are visually similar; it is hard to tell which results arebetter based on visualization.

    3.2 Social Quality EvaluationIn this section, we design and use two measures for assessingthe social coherence between places in the discovered clusters.Based on these measures, we assess the quality of DCPGS and thealternative approaches for clustering GeoSN places.

    3.2.1 Social Entropy based EvaluationThe first measure, called social entropy, measures the socialquality of the clusters based on the network communities that theGeoSN users form. Given a social network G = (U,E), we firstpartition all the users in U into several disjoint network commu-nities. Let PC be a cluster of GeoSN places and let UPC be theset of users who visit the places in PC , i.e., UPC = ∪p∈PCUp.According to the detected network communities, the visitingusers of PC , UPC , can be divided into several disjoint sets inCPC = {C1,C2, . . . ,Cm}, such that each set Ci is a subset ofusers in UPC belonging to the same network community. We callCPC the community set of PC .

    Definition 4. (Social Entropy) Given a cluster PC , let UPC bethe set of users who visit the places in PC , i.e., UPC = ∪p∈PCUp.The social entropy of PC is then defined as:

    E = ∑Ci∈CPC

    − ∣Ci∣∣UPC ∣

    log∣Ci∣

    ∣UPC ∣

    The social entropy, analogous to the entropy used in decisiontree induction, measures the impurity of a cluster PC with respectto the participation of its users into different communities. A lowsocial entropy means that the great majority of the visitors of PCcome from the same community (i.e., low impurity), indicatingthat the places in PC have tighter social relationships betweeneach other, which is favored.

    We applied the METIS community detection algorithm [11]to divide the set of users in the GeoSN into k non-overlappingcommunities7, providing a baseline for social entropy evaluation.To avoid a comparison that is biased to parameter k, we evaluatethe social entropy of clustering results obtained by DCPGS and thecompetitors for two different values of k. One value of k is chosenbased on the following rule of the thumb [18], i.e., k =

    √N/2

    where N is the number of users in the social network. The othervalue of k is decided by Dunbar’s number that suggests humanscan only comfortably maintain 150 stable relationships and themean community size is around 150. For dataset Gowalla, thesevalues are k = 313 and k = 1310, respectively.

    In Figure 4, we use social entropy to test our method DCPGSwith alternative social distance definitions, which are denotedby the name of the social distance measure used in each case.Figures 4(a) and 4(b) show the average social entropy for DCPGS,versions of DCPGS with alternative DS (SimRank, Commute-Time, Katz, and Jaccard) and PureSocialDistance when varying �.CommuteTime, and Katz have the lowest social entropy; however,these methods produce small clusters and have too many outliersas explained in Section 3.1. Within each small cluster, the placesare only visited by few people and this explains the low entropy.Jaccard also has low social entropy for the same reason. PureSo-cialDistance has low social entropy in most cases; this indicatesthat our proposed social distance between places is effectivein putting places with close social relationships together. When� = 0.1, the social entropies of DCPGS and PureSocialDistanceare similar, both good, since only those places with very closesocial distances are clustered. When � = 1, PureSocialDistancehas no social distance constraint τ thus its entropy becomeshigher than that of DCPGS. DCPGS outperforms SimRank-basedDCPGS, meaning that our proposed social distance is better thanthe SimRank-based DsimS (pi, pj) distance. When k = 1310, theaverage social entropy of all the methods is larger than in the casewhere k = 313, since a larger number of network communitiesincreases the probability that users in a single cluster belong todiverse communities. The average cluster size increases with �,increasing the probability that the visitors of a cluster belong todifferent communities; thus, the average social entropy increases.In addition, the clustering result becomes stable at large values of�, thus the social entropy converges. By visualization, we observedthat � should be set to a value smaller than 0.5 for the clusteringresults to be interesting.

    Figures 4(c) and 4(d) show the average social entropiesof DCPGS, DCPGS based on SimRank, CommuteTime, Katz,

    7. This is different from Metis used as a competitor of DCPGS in GeoSNplace clustering (discussed in Section 2.1.1).

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    9

    and Jaccard, and the graph-based clustering methods Metis andLinkClustering, when varying the social distance constraint τ .Similar to the case when � varies, the social entropy increasesand then stabilizes as τ increases, except for the entropy of Metis,which keeps increasing due to the network partitioning methodol-ogy of Metis with the increase of τ , the constructed place networkbecomes less connected, however, due to its partitioning nature,Metis puts disconnected places in the constructed place networkinto same cluster. When τ is less than 0.5, the social entropyof CommuteTime is zero, since with these distance measures theplaces in each cluster are visited by only one person when τ < 0.5.Jaccard has low social entropy also due to the small sizes of itsclusters. For τ ≤ 0.1 SimRank-based DCPGS fails to find anyclusters, therefore the entropy is 0. After investigation, we foundthat there is no pair of places pi, pj with DsimS (pi, pj) < 0.2because of the decay factor φ. When τ = 0.2, SimRank has a lowsocial entropy, since only few (987) clusters of small size are foundcompared to the 3605 clusters discovered by DCPGS. After thepoint where the two approaches find a similar number of clusters(e.g., at τ = 0.5, SimRank finds 5880 clusters, while DCPGSfinds 6742 clusters), DCPGS has constantly lower entropy thanSimRank. In addition, DCPGS is less sensitive to τ compared toSimRank. DCPGS outperforms the two graph-based competitorsMetis and LinkClustering. As we observed by visualization, inpractice τ should be set to a value higher than 0.5, because avery tight social distance constraint creates too few and too smallgeo-social clusters.

    Figures 4(e) and 4(f) show the average social entropies of thevarious versions of DCPGS and all the competitor approacheswhen varying the spatial distance constraint maxD (eps forDBSCAN). DCPGS is superior to SimRank-based DCPGS, DB-SCAN, LinkClustering and Metis for all values of maxD (eps).In general, the social entropies of all methods are not very sensitiveto maxD . For Gowalla, a good value for maxD is around 100;large maxD values result in clusters that are spatially too loose.

    Figure 5 compares the social entropy of SNN-based model(when varying epssnn ) and DCPGS (when varying �), whereSNN−60 and SNN−120 denote SNN-based model withmaxD = 60 and maxD = 120, respectively. Parameter τ is fixedat 0.7 and MinPts is set to 5. The first row of the x-axis representsthe values of � in DCPGS, while the second row contains thevalues of epssnn in SNN-based model. The range of � in DCPGSis [0,1]. Parameter epssnn in SNN is an integer in the range of[0,+∞). Small � in DCPGS can be translated into large epssnnin SNN-based model, which indicates that the condition of a placebeing a core is that there should be enough places with small geo-social distances surrounding it. We observe that the social entropyof SNN-based model does not change much with epssnn and itis worse than DCPGS with small �, and comparable with DCPGSwith large �.

    3.2.2 Community Score based EvaluationGiven a GeoSN place cluster PC , let UPC be the set of userswho visit the places in PC , i.e., UPC = ∪p∈PCUp. Assume eachUPC is a community in the GeoSN. We adopt the eight networkcommunity multi-criterion scores surveyed in [19] to computethe community score of UPC for each cluster PC . Figure 6compares the results of DCPGS and its alternatives on Gowalla(Katz is omitted because its result is quite similar to that ofCommuteTime), in terms of the internal density and conductancescores. We group the clusters discovered by each method by size

    △ DCPGS ▽ DBSCAN ▼ PureSocialDistance☆ Jaccard × SimRank ◻ Katz● CommuteTime ∎ LinkClustering ▷ Metis

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.5

    1

    1.5

    2

    2.5

    ²

    Soc

    ial E

    ntro

    py

    (a) k = 3130.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.5

    1

    1.5

    2

    2.5

    ²

    Soc

    ial E

    ntro

    py

    (b) k = 1310

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    τ

    Soc

    ial E

    ntro

    py

    (c) k = 3130 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    τ

    Soc

    ial E

    ntro

    py

    (d) k = 1310

    10 50 100 150 200 250 3000

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    (DBSCAN) or (others)

    Soc

    ial E

    ntro

    py

    meps maxD

    (e) k = 31310 50 100 150 200 250 3000

    0.51

    1.52

    2.53

    3.54

    (DBSCAN) or (others)

    Soc

    ial E

    ntro

    py

    mmaxDeps

    (f) k = 1310

    Fig. 4. Social entropy evaluation in Gowalla

    0.110

    0.29

    0.38

    0.47

    0.56

    0.65

    0.74

    0.83

    0.92

    1.01

    ² (epssnn)

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    Soci

    al E

    ntro

    py

    DCPGSSNN-60SNN-120

    (a) k = 313

    0.110

    0.29

    0.38

    0.47

    0.56

    0.65

    0.74

    0.83

    0.92

    1.01

    ² (epssnn)

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    Soci

    al E

    ntro

    py

    DCPGSSNN-60SNN-120

    (b) k = 1310

    Fig. 5. Social entropy of SNN geo-social clusters in Gowalla

    and compute and plot the average community score (i.e., internaldensity and conductance) for each cluster size group. The resultsbased on the other six criteria of [19] are similar and we omitthem due to lack of space. The internal density of UPC is definedby 1−mUPC /(∣UPC ∣(∣UPC ∣− 1)/2), where mUPC is the numberof edges, which belongs to E and whose two endpoints are bothin UPC , mUPC = ∣{(u, v)∣u ∈ UPC , v ∈ UPC , (u, v) ∈ E}∣. Con-ductance is the fraction of edges, which belongs to E, from nodesof UPC that point outside UPC , i.e., oUPC /(2mUPC + oUPC ),where oUPC = ∣{(u, v)∣u ∈ UPC , v ∉ UPC , (u, v) ∈ E}∣. Letf(UPC ) be the community score of PC , based on either internal

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    10

    density or conductance; a smaller value of f(UPC ) indicatesbetter social quality.

    0 50 100 150 200 250 300 350 4000.55

    0.6

    0.65

    0.7

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    Cluster size

    Inte

    rnal

    den

    sity

    PureSocialDistanceDBSCANSimRankDCPGS

    (a) Internal Density

    0 50 100 150 200 250 300 350 4000.75

    0.8

    0.85

    0.9

    0.95

    1

    Cluster sizeCo

    nduc

    tanc

    e

    PureSocialDistanceDBSCANSimRankDCPGS

    (b) Conductance

    0 50 100 150 200 250 300 350 4000.55

    0.6

    0.65

    0.7

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    Inte

    rnal

    den

    sity

    Cluster size

    MetisLinkClusteringCommuteTimeJaccardDCPGS

    (c) Internal Density

    0 50 100 150 200 250 300 350 4000.75

    0.8

    0.85

    0.9

    0.95

    1

    Con

    duct

    ance

    Cluster size

    MetisLinkClusteringCommuteTimeJaccardDCPGS

    (d) Conductance

    Fig. 6. Community score evaluation in Gowalla

    As Figures 6(a) and 6(c) show, the internal density increaseswith the cluster size. As the size of a place cluster increases, thedenominator of the internal density formula increases quadrati-cally while the number of social links between users in the cluster(i.e., the numerator) does not increase at the same pace. On theother hand, Figures 6(b) and 6(d) show that conductance initiallydecreases as the size of a cluster increases and fluctuates randomlyfor larger UPC sizes, which is in line with the observations in [19].In Figure 6, we observe that the geo-social clusters discoveredby DCPGS have better community scores (i.e., lower internaldensity and conductance scores) compared to all competitors,except PureSocialDistance. Since PureSocialDistance uses oursocial distance in clustering and disregards spatial proximity, itssocial quality is expected to be better than that of DCPGS; still, asshown in Figure 2(c), its clusters are not distinguishable spatially.DCPGS outperforms DBSCAN and SimRank-based DCPGS. Thequality gap between DCPGS and the 3 competitors in Figure 6(a)and 6(b) narrows as the size of clusters increases, since it is moredifficult for a larger UPC (usually obtained from a larger PC )to maintain a community-like structure compared to a smallerUPC [19]. This indicates that our social distance is effective infinding geo-social clusters with small or medium size. DCPGSis also generally better than the four competitors in Figures 6(c)and 6(d). CommuteTime has better community scores when thecluster size is around 50. Most community scores of Jaccard,CommuteTime, LinkClustering, and Metis concentrate at the top-left corner of Figures 6(c) and 6(d), which indicates that thesecompetitors have limited ability to discover geo-social clustersof various sizes. Furthermore, the quality gap between DCPGSand the five competitors in Figures 6(c) and 6(d) grows when thecluster size increases. We conclude that DCPGS (paired with oursocial distance measure) is the most effective method in findinggeo-social clusters with both good social quality and identifiablespatial contour.

    3.3 Temporal Effects on Geo-Social ClustersIn this section, we illustrate the discovered temporal-geo-socialclusters in the area of Manhattan on the Gowalla dataset and in thearea of Chicago on the Brightkite dataset using the three methodsintroduced in Section 2.3.History-Frame Geo-Social Clustering. Figure 7 shows thetemporal-geo-social clusters found using continuous historyframe method in periods 01/08/2009–31/01/2010, 01/02/2010–31/07/2010, and 01/08/2010-31/01/2011 in the area of Manhattan.We observe that some clusters evolve over time. For instance, thecluster in region A first expanded and then shrinked. During period01/02/2010–31/07/2010, there are multiple clusters in region B,while in period 01/08/2010-31/01/2011, those clusters are mergedinto one big cluster. In region C, before there exists no cluster,while later a new cluster appeared. Figure 8 shows the temporal-geo-social clusters found using the periodic history frames, i.e., onworking days and weekends. We observe different place clusterson working days and weekends which is expected, since normallypeople visit places related to work on working days, while visitentertainment places on weekends.

    ●●●●●

    ●●

    ●●●●●

    ●●●●●●

    ●●●

    ●●●●

    ●●

    ●●●● ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●●●●●

    ●●●

    ●●●●●

    ●●●

    ●●

    ●●

    ●●●●●

    ●●

    ●●

    ●●

    ●●

    ●●● ●●

    ●●

    ●●

    ●●●● ●●

    ●●

    ●●

    ●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    20

    40

    60

    d$cluster

    d$names● 0

    1

    2

    3

    4

    5

    (a) Working Days

    ●●●●

    ●●●

    ●●

    ●●●●●●

    ●●

    ●●●●

    ●● ●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ●●

    ● ●●●●

    ●●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    10

    20

    30

    40

    d$cluster

    d$names● 0

    1

    2

    3

    4

    5

    (b) Weekends

    Fig. 8. Periodic History Frame Geo-Social Clustering in Manhattan: � =0.5, MinPts = 5, maxD = 120, τ = 0.7, ω = 0.5.

    Temporally Contributing Users. Figure 9 shows the temporal-geo-social clusters found in Manhattan when parameter θ oftemporally contributing users is set to 1 week, 1 month, and6 months, respectively. When θ is a short time interval (e.g., 1week), the result reveals that the places in the same clusters arerevisited by socially connected users within a short period of time.As θ increases, the temporal-social distances between more placesdecrease, thus as expected (1) some clusters expand, (2) multipleclusters merge (e.g., regions A and B), (3) new clusters are found(e.g., region C). For marketing and management purpose, peoplemay be interested in both the short term and long term clusters.Damping Window. In Figure 11, we show the temporal-geo-social clusters found by the damping window method in Man-hattan, where the starting time is 01/02/2009 and the currenttime is set to 31/07/2010 and 31/01/2011, respectively. To betterunderstand the effect of the damping window, Figure 10 shows theclusters found by DCPGS on the same data in the same periods asthe damping window method. We observe that both the size of theclusters and the number of clusters found by the damping windoware smaller than DCPGS. This is expected, since weighing the olddata less increases the temporal-social distances between places.Under the same parameter setting, more regions in the dampingwindow are considered as sparse. However, the advantage of thedamping window is offering the up-to-date clustering result. As we

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    11

    ●●●●

    ●●●

    ●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    5

    10

    15d$clusterA

    (a) 01/08/2009–31/01/2010

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●●●●

    ●●●●●

    ● ●

    ●●

    ●●

    ●● ●●● ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●●

    ●●●●

    ●●

    ●●●●●

    ●●●●

    ●●

    ●●●

    ●●●●

    ●●●●●●

    ●●●●●●

    ●●

    ●●●●●●●●

    ●●

    ●●●

    ●●

    ●●●●

    ● ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    20

    40

    60

    d$clusterA

    B

    C

    (b) 01/02/2010–31/07/2010

    ●●

    ●●

    ●●●●●

    ●●

    ●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●●●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●●●

    ●●●●

    ●●●

    ●●

    ●●●●●

    ●●●●●●

    ●●●●●●

    ●●● ●●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●●

    ● ●●●●

    ●●

    ●●

    ●●

    ●●●

    ●● ●

    ●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    10

    20

    30

    40

    d$clusterA

    B

    C

    (c) 01/08/2010-31/01/2011

    Fig. 7. Continuous History Frame Geo-Social Clustering in Manhattan: � = 0.5, MinPts = 5, maxD = 120, τ = 0.7, ω = 0.5.

    ●●●●●●●●●

    ●●

    ●●

    ●●

    ●●●●

    ●●●

    ●●●●

    ●●●●

    ●●●●●

    ●●●●●●●

    ●●

    ●● ●●

    ● ●

    ●●

    ● ●

    ●●●

    ●●●●●●●●

    ●●

    ●●

    ●●●●

    ●●● ●●●●

    ●●●

    ●●●●●●

    ●●●●●

    ●●

    ●●●●●

    ●●●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    10

    20

    30

    40

    50

    d$cluster

    AB

    C

    (a) θ = 1 week

    ●●●●●●●●●●

    ●●●

    ●●

    ●●

    ●●

    ●●●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●●●

    ●●●●●●●

    ●●●●

    ●●● ●●●

    ●●

    ● ●

    ● ●

    ●●●

    ● ●

    ●●●

    ●●●●●●●●

    ●●

    ●●●●

    ●●● ●●●●

    ●●●

    ●●●●●●

    ●●●●●

    ●●●●●

    ● ●●●●

    ●●●●

    ●●●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    20

    40

    60d$cluster

    AB

    C

    (b) θ = 1 month

    ●●

    ●●● ●●

    ●●●

    ●●●

    ●●

    ●●

    ●●●●●

    ● ●

    ●●●

    ●●

    ●● ●

    ●●●●●●

    ●●●

    ●●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●●●

    ●●●●●

    ●●● ●●●

    ● ●

    ●●

    ●●

    ● ●

    ●●●

    ● ●

    ●●●

    ●●●●●●●●

    ●●

    ●●●●

    ●●● ●●●●

    ●●

    ●●●●●●●●

    ●●●●

    ●●

    ●●

    ●●●●●

    ●●●

    ●●

    ●●

    ●●

    ●●●●●

    ●●●●●●

    ● ●●

    ●●

    ●●●

    ● ●●

    ●●

    ●●●●●

    ●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    20

    40

    60

    d$cluster

    AB

    (c) θ = 6 months

    Fig. 9. Temporally Contributing Users in Manhattan: � = 0.5, MinPts = 5, maxD = 120, τ = 0.7, ω = 0.5

    can see in Figure 11, the discovered clusters for the two periods aredifferent, which are sensitive to time. For instance, by 31/07/2010,clusters are found on Lafayette St., while by 31/01/2011, clustersare found close to Washington Square Park. Nevertheless, it isnot easy to notice these interesting up-to-date small clusters inFigure 10 because of the accumulated old data.

    In summary, the three ways of considering the temporalinformation in the geo-social clustering yield different temporal-geo-social clusters, which may serve various purposes of analysisand investigation. The history-frame method offers the evolutionof clusters over time. The damping window method generatesthe up-to-date clusters. The temporally contributing user methodshows the places that are revisited within a period of time. Moreresults from the area of Chicago on the Brightkite data set can befound in Appendix A.Social Entropy based Evaluation Figure 12 shows the socialentropy of the temporal-geo-social clusters discovered by thethree methods introduced in Section 2.3 compared with the socialentropy of the geo-social clusters found by DCPGS. Recall that alow social entropy means that the great majority of the visitors ofa place cluster come from the same community, indicating that theplaces in the cluster have tighter social relationships between eachother. We observe that the clusters found by the damping window

    ●●

    ●●●

    ●●

    ●●

    ●● ●

    ●●

    ● ●

    ●●

    ●●●●●●

    ●●●●

    ●●

    ●●

    ●●●●●●●●

    ●●

    ●●●●

    ● ●

    ●●

    ●●●●

    ●●

    ●●●●●

    ●●●● ●

    ●●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    20

    40

    60d$cluster

    (a) 01/02/2009–31/07/2010

    ●●●●●●

    ●●

    ●●

    ●●●

    ●●●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●●●●

    ●●

    ●●●●●●●●

    ●●

    ●●●●

    ●●●

    ●●

    ● ●●

    ●● ●

    ●●●●●●●●●●● ●

    ●●

    ●●●●

    ●●●●● ●●●

    ●●

    ●●

    ●●●

    ●● ●●●

    ●●●●●●●

    ●●

    ●●●●●●●●●

    ●●●

    ●●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●●

    ●●

    ● ●●●

    ●●●●●●●●●●●●●●●

    ●●

    ●●

    ●●●●●●

    ●●

    ●●●●●●●●

    ●●

    ●●

    ●●●

    ●●

    ●●●●

    ● ●

    ●●

    ● ●●

    ●●●

    ●●

    ●●●●

    ●●

    ●●●●●

    ●●●●

    ●●

    ●●

    ●●● ●

    ●●●

    ●●●●●

    ●●

    ●●

    ●●●●●●●●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    20

    40

    60

    d$cluster

    (b) 01/02/2009-31/01/2011

    Fig. 10. DCPGS in Manhattan: � = 0.5, MinPts = 5, maxD = 120, τ =0.7, ω = 0.5

    method (DCPGSDW-Exp) and the temporally contributing usermethod (DCPGSTT) have better (lower) social entropy, comparedto the result found by DCPGS that does not consider the tem-porally information. Furthermore, the improvement achieved bydamping window method is larger than that of the temporallycontributing user method. However, the social entropy of the

  • 1041-4347 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2017.2782256, IEEETransactions on Knowledge and Data Engineering

    12

    ● ●●●

    ●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    2.5

    5.0

    7.5

    d$cluster

    (a) 01/02/2009–31/07/2010

    ●●●

    ●●●

    40.72

    40.73

    40.74

    40.75

    −74.01 −74.00 −73.99 −73.98 −73.97lon

    lat

    d$names● 0

    1

    2

    3

    4

    5

    2.5

    5.0

    7.5

    10.0d$cluster

    (b) 01/02/2009-31/01/2011

    Fig. 11. Damping Window in Manhattan: � = 0.5, MinPts = 5, maxD =120, τ = 0.7, ω = 0.5

    clusters found by the history frame method (DCPGSHF) is worsethan the result of DCPGS. DCPGSHF in fact splits the wholedata into several sub-datasets based on time frames and performsclustering on these sub-datasets. The social entropies of theclusters from these sub-datasets should not necessarily be smallerthan the clusters from the whole data. DCPGSTT and DCPGSDW-Exp require more intra temporal closeness among places withinthe temporal-geo-social cluster. The smaller entropies of thesetwo methods indicates that temporal constraints can enhance thesocial connections within a cluster. According to the analysisin previously sections, the result of the community score basedevaluation is consistent with the result of the social entropy basedevaluation, and thus is omitted.

    4 RELATED WORKOur clustering problem is related to various research topics,including traditional spatial clustering, using mobility data toanalyze places, clustering using spatial and non-spatial attributes,studying the relationship between spatial and social attributes,community detection, and other work on GeoSNs.Spatial Clustering. Spatial clustering algorithms, surveyed in[20], are divided into three categories: partitioning, hierarchicaland density-based clustering. Partitioning methods, including k-means, k-medoids, and CLARANS [21], are good at findingspherical-shaped clusters in s