community detection using geosocial data and geographical ...bertozzi/workforce/reu 2013/social...

Community Detection Using Geosocial Data and Geographical

Point-Set Distances

Ryan de Vera, Qui Pham, Juhyun Kim STINKERTON

August 9, 2013

Abstract

The 2011 and 2012 UCLA REU Social Networks groups had inspected the gang activity inHollenbeck and attempted to predict gang affiliations from geosocial information from LAPDField Interview cards, using the spectral clustering algorithm. In the present work, we continuethis method and aim to achieve a better clustering by using a more meaningful distance measureand modeling the geography and natural boundaries of the network. Temporal component of thedata and effects of gang injunctions are discussed as well. Lastly, we compare the algorithmicallygenerated communities to our ground truth to measure the quality of the clusters with threedifferent metrics: purity score, z-rand score, and normalized mutual information.

1 Introduction

Our research revolves around the data from Hollenbeck, a policing district in East Los Angelesthat is known for high gang activity and violence. There are about 31 known gangs in the area,which explains the prevalence of violence and gang rivalry activity in Hollenbeck. Not surprisingly,this unique area has drawn interests of criminologists, and anthropologists, not to mention the LosAngeles police department. For the previous two years, the UCLA REU Social Networks groupshave worked on this gang data. In 2011, the REU group analyzed the community structure ofone year of data that was cleaned hand by hand [8]. The 2012 Social Networks group designed anautomatic algorithm that cleaned up the inconsistent, noisy data [7].

From 2000 to 2011, field interview (FI) cards had been collected by LAPD officers. When apolice officer has an interaction with the public, the individuals were required to fill out the card,which has about 60 field entries. These card entries contain useful information for our research;some of them include stop location, gang affiliation, gang territory, and time of stop. In thedata, people are represented by their geographic and social information, namely the geographicalcoordinates of their stop locations and who they were stopped with (social connections). Thisyear we focused on a subset of the data of previous years by eliminating individuals without socialconnections, i.e. those who were only stopped alone. Using social and geographic data, we aim topredict gang affiliations of individuals. In the process, we have used different methods of clusteringand community detection and compared the results. In addition, we are interested in the temporalcomponent of the data. Since the data spans over ten years, clusterings of some individuals may bemore relevant or irrelevant due to the time difference. Particularly, we will look at gang injunction,which is a court order that prohibits members of a named gang from being seen together in definedareas. If there are multiple gangs named in a injunction, interactions between members of differentgangs under injunction are also forbidden. After consulting criminologists specializing in the region,we expect that gang activities would be affected with the enforcement of injunction. Many of the

1

ideas and methods developed during this project have been designed to be unsupervised so thatthey can be easily adapted in a large array of applications and sociological environments.

1.1 Motivation

Community detection is applicable to a large number of fields, including computer science, engi-neering, network science, criminology, and anthropology as well as in our research. We aim to createan algorithm that automatically detects different communities structures, given sparsely observedgeographic and social information.

1.2 Setting

Figure 1: Map of Hollenbeck with 31 Gang Territories

Hollenbeck, located east of downtown Los Angeles, is bounded by the Los Angeles River in thewest, Pasedena Freeway in the North, Route 710 in the northeast, and the city of Vernon in thesouth. It is separated into multiple regions by Interstates 5 and 1, Routes 6 and 101, and therailroad along the Mission Road. These boundaries limit the interaction between different gangsand between those in Pasadena and East Los Angeles.

1.3 Data

Our data is based on the Field Interview cards, which had been collectecd from 1998 to 2011 inHollenbeck area. The data contains 34303 cards, 8894 of which have the known gang affiliations.Because some cards had missing coordinates or zero coordinates, we deemed that only 8093 cardswere usable. 8093 cards contain the information of 1820 gang members, each of whom is givenan unique ID, gang affiliations, a set of geographical coordinates of his or her non-criminal stoplocations, and a set of time associated with stops. Unlike previous year’s groups, we eliminatedindividuals without social connections, or those who were stopped alone.

2

Figure 2 shows the stop locations and true gang affiliations for 1820 individuals that we take asthe ground truth in our analysis. These 1820 unique individuals were created by the Soft-TFIDFwith tensor products algorithm from [7]. Each dot represents a stop of an individual from 2000 to2012. These locations may not be unique to just one individual; many data points that representmultiple people and/or multiple time points exist at the same time. Each dot is color coded tocorrespond to the known gang affiliation of the 31 gangs in Hollenbeck. Although some points ofthe same color are bundled together, the social activity of members of each gang are non-linearlyseparable. Therefore, we would like to apply the spectral clustering to see if we can predict gangaffiliations of individuals in Hollenbeck district.

Figure 2: Ground Truth for Hollenbeck Data

2 Methodology

Since k-means [2], one of the most popular clustering algorithms, cannot separate non-linearlyseparable clusters, we employed the spectral clustering algorithm.

2.1 Spectral Clustering

Following the work of the 2011 REU and 2012 REU, the first method we applied is spectralclustering. Recently, spectral clustering [6] has been favored over other clustering methods such ask-means due to its simplicity and ability to separate non-linearly separable data [3]. We use thenormalized spectral clustering algorithm [4] to cluster the data points:

3

Normalized spectral clustering according to Ng, Jordan, and Weiss (2002)Input: Adjacency matrix A ∈ Rn×n, number k of clusters to construct

1. Compute D = (dij), where dij =

∑nj=1 aik i = j

0 i 6= j

2. Compute L = I −D−1/2AD−1/2

3. Compute the k eigenvectors v1, . . . , vk associated with the k smallest eigenvalues of L

4. Compute U = (uij) where uij =vij

(∑

k(vik)2)1/2

5. Cluster yi = (uij)j=1,...,k, i = 1, . . . , n, into G1, . . . , Gk with k-means or k-medoids

Output: Clusters C1, . . . , Ck with Ci = j|yj ∈ Gi

3 Measure of Similarity

For spectral clustering, we first construct a similarity graph A ∈ Rn×n. The equation for the graph’ssimilarity matrix is:

A = (aij) = αS + (1− α)G

where α ∈ [0, 1] is a parameter that controls the weight between social and geographic information,see [8] for details. The closer α is to 0, geographic component has more contribution while α = 1represents that only social component is being used.

3.1 Social Matrix

The social matrix is a matrix whose entries attempt to represent the social connection betweenindividuals. We create the frequency matrix F = (fij) where fij = |Oi ∩ Oj | to keep track of thenumber of times two gang members are stopped together. We then create two models for the socialmatrix S = (sij) to captivate the social similarity between gang members:

Binary Model Following the work of the social networks REU group from 2011 [8] and 2012 [7],we first used the following binary model:

sij =

1 fij > 0

0 fij = 0.

If i and j were stopped together at least once, the entry in the social matrix, sij , is assigned 1 and 0otherwise. Although the binary model requires minimal computation and makes the social matrixS symmetric, it not only lacks a differentiating power due to its binary nature but also ignoresvaluable information such as the magnitude of the frequencies fij . For example, the fact that 1 and2 were stopped 20 times together while 1 and 3 were stopped together only once are not reflectedin the social matrix since both s12 = s13 = 1.

4

Logarithmic Model To create a new social similarity function that captures the magnitude ofthe frequencies fij and is resilient to outliers, we need to satisfy the following conditions:

1. Mapping the set of natural numbers to [0,1], i.e. s: N 7→ [0, 1]

2. Being non-decreasing, i.e. ∀fi, fj ∈ N, fj ≥ fi ⇒ s(fj) ≥ s(fi)

3. Flattening out, i.e. ∀f ∈ N, s(f + 2)− s(f + 1) ≥ s(f + 1)− s(f)

4. Reaching the maximum value after a certain threshold t,i.e. ∀f ∈ N, f > t⇒ s(f) = 1.

Figure 3: Social Similarity Function

Following is the logarithmic model that satisfies the required conditions:

sij =

ln(fij) + 1

ln(maxi,j(fij)) + 1)fij < t

1 fij ≥ t.

We choose t = 5 for our computation.

3.2 Geographical Matrix

The social networks REU groups of 2011 [8] and 2012 [7] use the L2 distance of the arithmeticmeans of all geographical coordinations of two people to measure the geographical connection of

the two individuals: d(Oi, Oj) = dL2

(∑(xk,yk)∈Oi

(xk, yk)

|Oi|,

∑(xl,yl)∈Oj

(xl, yl)

|Oj |

). After averaging

k-means runs, the following result was obtained [See Figure 5]. These results are consistent withwhat was found in the work from 2011 and 2012.

5

Figure 4: Eigenvector 2, 3, 4, 5, 6, and 7 based on average locations. 31 clusters and 31eigenvectors were chosen to represent the 31 gangs in the eigenspace.

After averaging k-means runs, the following result was obtained [See Figure 5]. These resultsare consistent with what was found in the work from 2011 and 2012.

Figure 5: Clustering Result with Average Location with α = 0.9gave a purity = .4879, z-Rand = 209.1098

Though the distance function requires only simple computation, it suffers many egregious lim-itations as indicated by the purity score above:

1. Lacking differentiating power, i.e. the function cannot distinguish individuals whose arith-metic means of their geographical coordination are equal.e.g. O1 = −20, 20;O2 = −3, 1, 2;O3 = 0;

6

2. Being vulnerable to outliers, e.g. considering O2 = −10 more geographical similar toO3 = 0 than O1 = −55,−3, 0, 1, 2 is to O3

3. Assigning improper geographical similarity by not utilizing information stored in the multiplegeographical coordinations of individuals, e.g. considering O2 = −3, 3 more geographicalsimilar to O3 = 0 than O1 = 1, 1, 1, 1 is to O3; and

4. Ignoring native geographical information, such as boundaries, railroads and freeways, andimpassable terrains.

3.2.1 Point-Set Distances

To devise a distance function with good differentiating power, we need to be able to not only uti-lize the multiple geographical coordinations of individuals, instead of aggregating them, but alsopreserve the symmetry of the geographical matrix G. Therefore, we try multiple directed distancesbetween sets and symmetrizing functions as defined in the work of Dubuisson and Jain [1]:

Directed distances

d(a,B)=minb∈B d(a,b)d1(A,B)=mina∈Ad(a,B)d2(A,B)=50Kth

a∈Ad(a,B)d3(A,B)=75Kth

a∈Ad(a,B)d4(A,B)=90Kth

a∈Ad(a,B)d5(A,B)=maxa∈A d(a,B)

d6(A,B)=1

|A|∑

a∈A d(a,B)

where xKtha∈A is the K-th ranked distance such that K/|A| = x%

Symmetrizing functions

f1(d(A,B), d(B,A))=min(d(A,B), d(B,A))f2(d(A,B), d(B,A))=max(d(A,B), d(B,A))

f3(d(A,B), d(B,A))=d(A,B) + d(B,A)

2

f4(d(A,B), d(B,A))=|A|d(A,B) + |B|d(B,A)

|A|+ |B|

We define the point-set distances dSij = fi(dj(A,B), dj(B,A)). Note that the normal Hausdorffdistance is dH = dS25 = max(maxa∈A d(a,B),maxb∈B d(b, A)) and the Modified Hausdorff distance

[1] is dMH = dS26 = max

(maxa∈A d(a,B)

|A|,maxb∈B d(b, A)

|B|

).

3.2.2 Geographic Distance

In real life, the amount of distance an individual needs to go from one point to another is notthe L2 distance between the two points since the straight path between them is usually blockedby houses, freeways, railroads, and impassable terrains. Figure 6 shows that the highways ofHollenbeck, represented as red line, are relatively impervious. Figure 7 is the result of adding

7

more geographical features to the Figure 6. Blue dotted line represents a train track that goesalong the Mission Road; two small regions within region 1 are Rose Hill Park, and Ascot Hill Parkrespectively. Therefore, we came up with a geographic network which can help us define distancesbetween different regions [Refer to Figure 8].

Figure 6: Map of Hollenbeck with regions divided

Figure 7: Map of Hollenbeck with somerailroads, and parks included

Figure 8: The Geographic BoundaryNetwork

The geographical distance measuring the true distance that an individual actually has to xi goto reach another point xj is defined as the shortest path between xi and xj on an undirected graphG = (Ω ∪ I, E) where Ω = x : x ∈ Oi is the set of all geographical coordinations of individualsin the data set, I is the set of all geographiacal coordinations of street intersections in Hollenbeckarea, and E is the set of edges between points in Ω ∪ I with values being the Lp distance betweenthe points.

Since not only the street intersections are hard to obtain and amount to an overwhelming

8

number, but also the graph searches for the shortest path between 5368 geographical coordinationson an enormous graph poses a serious computational problem, we settle for an approximation ofthe geographical distance between two points xi and xj as the shortest path between xi and xj onan undirected graph G = (Ω ∪ P,E) where P is the set of geographical coordinations of passagesfrom one region of Hollenbeck area to another and Ω and E have the same definitions as above.

There are several passages that allow entrance or exit between regions. After analyzing themon Google Maps, we found that there are about 24 passages that lead to different regions. To keepthings simple, we first selected one passage among multiple passages that lie on the border of twodifferent regions. If two vectors are situated in the same region, we use standard Lp distance asbefore. If two points are located in different regions, we calculated the sum of the distance of xi tothe passage and distance of xj to the passage.

3.2.3 Geographical Similarity Measure

We combine both the point-set distances and the geographical distance to create a geographicalsimilarity measure for the geographical matrix G by using different p to calculate the Lp distancein computing the edge values for the geographical distance dG between geographical coordinations,and then using the geographical distances to caculate d(a,B) = minb∈B d(a, b) in computing the

point-set distances. We define the geographical matrix G = (gij) with gij = exp

(−d2Skl(Oi, Oj)

σiσj

).

The σi = dPkl(Oi, OK), where OK is the K-th nearest neighbor of the i-th individual Oi, controls

the width of the similarity neighborhood of the i-th individual Oi.

4 Temporal Information and Gang Injunctions

Now, we take account the temporal component into our research. Since the data we have spans overten years, social connections made between individuals may be inappropriately drawn. Figure 9 isthe plot of stop locations of individuals from 2001 to 2011. Different color represents different year.

Figure 9: Stop Locations Over Time (2001-2011). This plot displays the stop locations of uniqueindividuals over the years 200-2011. We look at these stop locations for the years 2006 and 2007

to see if our clusters change because of the injunctive gangs in these years.

To further extend our analysis, we analyze two gang injunctions which prohibit alleged gangmembers and their associates from doing certain things within a defined area or neighborhood.

9

There have been seven total gang injunctions in Hollenbeck area. Out of those seven, we only lookat four injunctions, which happened in 2006 and 2007. The injunction issued in 2006 was for thegang White Fence, and injunctions that were issued in 2007 were for the gangs Clover, Eastlakeand Lincoln Heights. We would like to see the effects of the court order over these two years onour clustering results.

When looking at the clusters produced by these two years we see that the purity of the clustersfor 2007 is higher than the purity of the clusters in 2006 [Refer to Figure 10]. This may be attributedto the gang injunctions that were issued. With an issued injunction, members from those gangscannot be seen with other members of that gang in that territory, thus limiting the freedom ofthe injunctive members. One hypothesis for the increased purity is that rivals and allies of theseinjunctive gangs do not want an injunction against them. This could have reduced the activityof the gangs and members being seen. This hypothesis would also explain the higher purity forα = 0.7 opposed to the consistent α = 0.9 that we saw over all years. If it is true that gangmembers are being seen together less, we would have to rely more on the geography of their stoplocations when they occur to provide us with information. The z-Rand score obtained for thesetwo years, as seen in Figure 11 also confirms that the clusters produced are of quality.

Figure 10: Purity scores before and afterinjunction in 2006

Figure 11: Z-Rand scores before and afterinjunction in 2007

The figures 10 and 11 use α = 0.9 in the equation (3). To determine this value, we lookedat each purity and z-Rand score resulting from similarity matrix of different α from 0 to 1 byincrements of 0.5. Interestingly, highest purity and Z-Rand scores were observed between 0.9 and0.95. The fact that highest purity and z-Rand score were obtained by the same α signifies that our

10

clustering is quite meaningful.

Figure 12: Z-Rand Score for α Figure 13: Purity Score for α

4.1 Optimizing α and σ

As mentioned before, the value of α dictates the relative contribution between geographical andsocial information. Because the best performing α, namely 0.9, was so close to 1, meaning socialinformation, some may conjecture that geographic information is relatively insignificant. However,above graphs display a large drop in both purity and z-Rand scores when α =1. In other words,empowering social information may produce better clustering, but the social component itself aloneis meaningless.

Next, we look at the scaling parameter σ from geographic matrix, which we defined as,

σi = d(xi, xK)

where xK is the Kth nearest neighbor of the point xi. We chose K = 7 because in general K = 7produced the best purity and z-Rand score. Figure 14 compares z-Rand scores of the similaritymatrix from the 2006 data based on different α and σ values while Figure 15 examines purity scores.In the figure, α increments by 0.05. Although the optimal σ value changes every year, we foundthat it usually ranges between 0 and 15, depending on the size of the data. Larger σ’s were alsoexplored up to 200 nearest neighbors. Results past 25 nearest neighbors yielded negative change.The negative changes occur when we are looking at nearest neighbor that is too far away, whichcauses the distance between the points to become useless and to produce a gaussian neighborhoodthat includes many points that may not be similar at all.

11

Figure 14: Z-Rand Score for SocialInformation(α) vs. σ

Figure 15: Purity Score for SocialInformation(α) vs. σ

The following figure compares z-Rand scores of the similarity matrix of the data over all years

Figure 16: Z-Rand score for SocialInformation(α) vs. σ

Figure 17: Purity score for SocialInformation(α) vs. σ

5 Results

5.1 Evaluation of Clustering

The three measures that we primarily use to evaluate the quality of our clusterings are purity,z-Rand scores, and NMI.

5.1.1 Purity Score

Purity score is obtained by assigning each cluster to the class that appears most frequently in thecluster, counting the number of correctly assigned points, and then dividing by N, which is thenumber of clusters. In our case, purity measures the percentage of correctly classified individuals.Formally, purity is written as

Purity(Ω, C) =1

N

∑k

maxj|ωk ∩ cj |.

Bad clusterings give a purity score close to 0 while perfect clustering yield a purity of 1 [?].

12

5.1.2 Normalized Mutual Information

Although purity score is useful, it gives high purity score for large number of clusterings. Hence,we also use normalized mutual information or NMI, one of the more unforgiving metrics.

Unforgiving in the sense that it not only tells us which information both sets of clusteringsshare but also how random the clusters produced from our method are. Formally, NMI is definedas,

NMI(Ω,C) =I(Ω;C)

[H(Ω) +H(C)]/2

where

I(Ω;C) =∑k

∑j

P (ωk ∩ cj)logP (ωk ∩ cj)P (ωk)P (cj)

=∑k

∑j

|ωk ∩ cj |N

logN |ωk ∩ cj ||ωk||cj |

H(Ω) = −∑k

P (ωk)logP (ωk) = −∑k

|ωk|N

log|ωk|N.

Here I(Ω;C) is mutual information, which measures how much the two assignments agree. I(Ω;C)= 0 implies that the clustering is random. H, on the other hand, refers to entropy. By normalizingthe mutual information with the denominator [H(Ω) + H(C)]/2, we resolve the bias of the purityscore toward large numbers of clusters since entropy increases with the number of clusters. Thisway, we maintain NMI within a range between 0 and 1 [?].

5.1.3 z-Rand Score

The third evaluation metric that we use is the z-Rand score. For the z-Rand score, the pair countingquantity w11 is first introduced. w11 is the number of pairs that belong both to the same clusterin k-means and to the same gang according to FI card entry. Then, the z-Rand score would be thenumber of standard deviations which w11 is removed from its mean value under a hypergeometricdistribution [7]. As opposed to purity score, z-Rand score measures correctly identified pairs.Intuitively, the z-Rand score tells us how far away from random our clustering result is. Thesmaller the z-Rand score the more random and less meaningful our clusters are. A larger z-Randscore says that our clustering result is getting farther from random and therefore the result is moremeaningful.

5.2 Comparison of Point-Set Distances

Observation

• Modified Hausdorff distance yields higher purity scores than Normal Hausdorff distance

• Social information complements geographical information regardless of social and geographi-cal similarity functions

• For the same directed distance dj , the descending order of purity scores is f1 > f3 > f4 > f2

• For the same symmetrizing function fi, the descending order of purity scores is d1, d6 >d2, d3, d4, d5

13

Figure 18: Comparison of Symmetrizing Functions with Binary Model

Figure 19: Comparison of Symmetrizing Functions with Logarithmic Model

14

Figure 20: Comparison of Directed Distances with Binary Model

Figure 21: Comparison of Directed Distances with Logarithmic Model

15

Figure 22: Comparison of 4 Best Point-Set Distances

Figure 23: Comparison of Symmetrizing Functions

directed functiondistance f1 f2 f3 f4

d1 0.6181 0.6142 0.6181 0.6172

d2 0.6206 0.5803 0.5875 0.5825

d3 0.6121 0.5774 0.5880 0.5829

d4 0.6189 0.5774 0.5930 0.5816

d5 0.6151 0.5795 0.5854 0.5812

d6 0.6189 0.6032 0.6168 0.6104

Table 1: Maximum purity scores for 24point-set distances and binary model

directed functiondistance f1 f2 f3 f4

d1 0.6210 0.6193 0.6176 0.6206

d2 0.6117 0.5757 0.5837 0.5727

d3 0.6189 0.5651 0.5795 0.5774

d4 0.6155 0.5740 0.5850 0.5884

d5 0.6168 0.5791 0.5901 0.5812

d6 0.6202 0.5922 0.6219 0.6087

Table 2: Maximum purity scores for 24point-set distances and logarithmic model

16

5.3 Comparison of Lp Distances

Observation:

• Purity scores for p < 1 are significantly lower than ones for p ≥ 1

• Purity scores for p ∈ 1, 2, 3 are almost the same

• Purity scores for α = 1 are the lowest

– Imply the dominance of the geographical information over the social information due tothe sparsity of the latter

• For the binary model, the purity score increases with α, reaching the maximum around 0.92

• For the logarithmic model, the purity score reaches the maximum around 0.63

• The logarithmic model does not yield higher maximum purity score than the binary model

Figure 24: Comparison of Lp Distances for Hausdorff Distance

5.4 Geographical Distance

The geographical matrix is still built by the arithmetic means of geographical coordinations ofindividuals, using the geographical distance dG instead of using the L2 distance. The following arethe results we gained from this geographic matrix. Refer to Figure 25 and 26. The geographicdistance created at this point has only a few points defined as passages. The geographic distancewith only a few points may still inhibit the use geography and therefore the distances produced

17

between people may not be correct. This will be reflected in our results.

Figure 25: Eigenvector 2, 3, 4, 5, 6, and 7 based on Geographic Distances on an individualsaverage location.

Figure 26: Clustering Result with Geographic Distance [Average Location](α = 0.9, Purity = 0.4763, z-Rand = 181.79)

Incorporating the geographic distance acts contrary of our expectations. Purity drops from0.4879 to 0.4783 and the Z-rand score drops from 209.1098 to 181.79. The small drop in metricsmay be due to the limited number of passages that were included in the geographic distance. Thisresult could also be because we are using the individuals average location, which also does notreflect the geography well. Instead of using individuals’ average location and representing them asone point, we will use the set of their stop locations and find distances between individuals pointsets.

18

5.5 Hausdorff Distance

Instead of using average locations, we use the hausdorff distance, which measures how similar twosets are from each other. In this case, it is defined by,

dH(Oi, Oj) = max

supx∈Oi

infy∈Oj

dLp(x, y), supy∈Oj

infx∈Oi

dLp(x, y).

Thereby, we got the following eigenvectors [Figure 27].

Figure 27: Eigenvector 2, 3, 4, 5, 6, and 7 based on Hausdorff Distance

In Figure 17, these eigenvectors display distinct hotspots that were not as distinct as before.Eigenvector 4 shows the structure of the Big Hazard gang represented by the green data points inthe middle.

Figure 28: Clustering Result with Hausdorff Distance(α = 0.9, Purity = 0.6304, z-Rand = 402.65)

19

Adopting Hausdorff distance resulted in a significant boost in our purity score. Instead of usinga persons average location the Hausdorff distance allows us to look at the set of stop locationsbetween two individuals. The average location does not take into account the true social-activitylocations at which an individual was stopped. The Hausdorff distance allows us to do this andcompute a distance between the set of stop locations of each individual.

5.6 Geographical Hausdorff Distance

The information in the geographic matrix should better reflect the geography add to the socialinformation. We also replaced the Euclidean distance with geographic distance, dG, for the nextexperiment such that our dij for geographic matrix is defined by,

dH(Oi, Oj) = max

supx∈Oi

infy∈Oj

dG(x, y), supy∈Oj

infx∈Oi

dG(x, y).

However, it rather brought down our purity score. This small drop again may be due to thefact that we are using a limited number of passages when computing the geographic distancebetween different regions. Figure 31 show eigenvectors based on Geographic Hausdorff Distance,and Figure 32 show the clustering result.

Figure 29: Eigenvector 2, 3, 4, 5, 6, and 7 based on Geographic Hausdorff Distance.

20

Figure 30: Clustering Result with Geographic Hausdorff Distance(α = 0.9, Purity = 0.6265, z-Rand = 373.58)

5.7 Geographical Hausdorff Distance

The best result produced was with using the geographic matrix created by using the ModifiedHausdorff distance.

Figure 31: Eigenvector 2, 3, 4, 5, 6, and 7 based on Geographic Hausdorff Distance.

21

Figure 32: Clustering Result with Geographic Hausdorff Distance.α = 0.9, Purity = 0.74121, z-Rand = 490.406

α NMI Purity Score Z-Rand Score

0 0.5503±0.0076 0.6275±0.0121 372.8460±25.77940.05 0.5517±0.0089 0.6341±0.0122 380.0820±22.57150.1 0.5517±0.0083 0.6269±0.0121 388.6803±24.08070.15 0.5522±0.0078 0.6308± 0.0120 384.8915±24.89590.2 0.5557±0.0075 0.6330± 0.0112 388.9322±23.48060.25 0.5555±0.0070 0.6478± 0.0100 415.8372± 25.00970.3 0.5583±0.0084 0.6462± 0.0139 412.9919±24.22050.35 0.5578±0.0076 0.6429± 0.0123 401.4752±28.30650.4 0.5609±0.0077 0.6495± 0.0111 416.2250±28.87740.45 0.5741±0.0104 0.6533± 0.0131 413.5694±30.13500.5 0.5684±0.0076 0.6604± 0.0111 424.8999±25.79960.55 0.5847±0.0095 0.6648± 0.0117 436.5042±19.28900.6 0.5903±0.0097 0.6780± 0.0134 441.8239±24.32450.65 0.6035±0.0090 0.6934± 0.0127 451.2703±20.43330.7 0.6184±0.0094 0.6857± 0.0113 445.1817±19.01380.75 0.6300±0.0078 0.7060± 0.0127 450.5551±17.44100.8 0.6477±0.0078 0.7159± 0.0108 460.9598±17.60980.85 0.6677±0.0082 0.7225± 0.0110 476.1493±21.3530

0.9 0.7018±0.0128 0.7445± 0.0168 490.9406±26.2097

0.95 0.7147±0.0102 0.7484± 0.0166 482.7018±21.11271 0.4233±0.0153 0.4681± 0.0176 47.2456±4.9858

To obtain the results above we use spectral clustering with the affinity matrix created by usingour binary social adjacency matrix and the geographic matrix with the Modified Hausdorff distance.For the local scaling of σ the nearest neighbor K = 5 was used. We choose 31 eigenvectors of thegraph laplacian (because of the 31 active gangs in the Hollenbeck district) created from this affinitymatrix. We run k-means 50 times on these eigenvectors because of the random initialization of the

22

centroids and take the maximum over all runs.

6 Conclusions

In this paper, we have applied the spectral clustering algorithm to an LAPD FI card data setwhich concerns gang members in the policing division of Hollenbeck. In contrast to REU groups ofprevious years, we used a different distance metric to compute the similarity between individualsand their stop locations. Instead of using the average location of individuals, we used the set oftheir stop locations and used a Modified Hausdorff distance to compute the similarity betweenthe sets. Based on sets of stop locations and social connections, which were determined by theother individuals who were present at each stop, we clustered all individuals into groups, whichwe interpret as the gang affiliation. We showed that sufficient social connections and a newlydefined geographic matrix based on similarity of sets of stop locations lead to a clustering which isapproximately 75% pure compared to the ground truth gang affiliations. While giving more weightto social component produced the optimal result, it became apparent that the geography plays ahuge role in our clustering algorithm since purity drops significantly with social information alone.We also found out that using each individual’s set of stop locations, as opposed to using an averagelocation, improve clustering results. In this paper, we also investigated clustering for the years2006 and 2007. The gang White Fence was issued an injunction in 2006 and the gangs Clover,Eastlake and Lincoln Heights were issued an injunction in 2007. Sociologically, we would expect itto hinder gang activities between injunctive gangs and both allies and rivals, possibly leading toan improvement in clustering results. Clusters for 2006 were about 77% pure while the clusters for2007 were about 82% pure. This may suggest that injunctions have an impact on our clusters. Forfuture studies this area will be further investigated.

Future studies will also investigate different methods including the multislice method of [5] andthe graph p-Laplacian spectral clustering method of [4]. Future studies will also further investigatehow temporal data affects clustering.

7 MATLAB and C++ codes

The code for the project was written in both Matlab and C++. The set-up for the affinity matrixwas written in Matlab because of the vectorization of the parameters and easy storage of matrices incell arrays. Depending on the discretization for both α, say n+1, and σ, say m+1, a cell array of n+1×m+1 was initialized. The code for local scaling of σ was written in C++ and provided by Zelnik-Manor and Perona and can be found at http://www.vision.caltech.edu/lihi/Demos/SelfTuningClustering.html.The code for set-up of the graph laplacian and spectral clustering were written in Matlab.

All distances used to compute the geographic affinity matrices were written in Matlab. Forheavy computations our algorithms were run in parallel on the Hoffman2 cluster at UCLA.

23

8 Acknowledgments

We would like to express our greatest gratitude to our mentors, Dr. Blake Hunter and Dr. TheodoreKolokolnikov, who have helped and supported us throughout our project. We are grateful to UCLAREU group from last year (Anna Ma, Daniel Moyer, Brandon Schneiderman, and Ryan de Vera)for their assistance on our research. We also thank Matthew Valasik, UCI criminologist, for usefuladvices on gang activities. At last but not the least, we would like to thank Dr. Andrea Bertozziwho organized REU summer program and made all these things happen.

24

References

[1] M-P Dubuisson and Anil K Jain. A modified hausdorff distance for object matching. In PatternRecognition, 1994. Vol. 1-Conference A: Computer Vision & Image Processing., Proceedings ofthe 12th IAPR International Conference on, volume 1, pages 566–568. IEEE, 1994.

[2] J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. JSTOR: Applied Statistics,28(1):100–108, 1979.

[3] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Seriesin Statistics. Springer New York Inc., New York, NY, USA, 2001.

[4] Dijun Luo, Heng Huang, Chris Ding, and Feiping Nie. On the eigenvectors of p-laplacian.Machine Learning, 81(1):37–51, 2010.

[5] Peter J Mucha, Thomas Richardson, Kevin Macon, Mason A Porter, and Jukka-Pekka On-nela. Community structure in time-dependent, multiscale, and multiplex networks. Science,328(5980):876–878, 2010.

[6] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on PatternAnalysis and Machine Intelligence, 22:888–905, 1997.

[7] Y. van Gennip, B. Hunter, A. Ma, D. Moyer, R. de Vera, and A. L. Bertozzi. Record matchingincomplete and noisy data. To appear, 2013.

[8] Yves van Gennip, Blake Hunter, Raymond Ahn, Peter Elliott, Kyle Luh, Megan Halvorson,Shannon Reid, Matthew Valasik, James Wo, George E Tita, et al. Community detection usingspectral clustering on sparse geosocial data. SIAM Journal on Applied Mathematics, 73(1):67–83, 2013.

25

community detection using geosocial data and geographical ...bertozzi/workforce/reu 2013/social...

Documents