community discovery in social network yunming ye department of computer science shenzhen graduate...

Community Discovery in Social Network

Yunming Ye

Department of Computer Science

Shenzhen Graduate School

Harbin Institute of Technology

2

Agenda Introduction to Social Network and

Community Discovery

Classical Community Discovery Algorithms

Hot Research Issues

3

Introduction to Social Network and Community Discovery

Studies on Networks

Lots of “Networked” data!! Technological networks

Power-grid, road networks Biological networks

Food-web, protein networks Social networks

Collaboration networks, friendships Language networks

Semantic networks

Studies on Networks

Social Networks QQ Kaixin Renren Facebook Email Twitter Co-citation Blog

Community A property that seems to be common to many

networks is community structure.

Community: The division of network nodes into groups within which the network connections are dense, but between which they are sparser.

Subjectivity of Community Definition

Each component is a communityA densely-knit

community

Definition of a community can be subjective.

(unsupervised learning)

Definition of a community can be subjective.

(unsupervised learning)

Community Detection Community Detection: Find the community

structure from the social network. Community detection is important:

Identifying modules and their boundaries allows for a classification of vertices, according to their structural position in the modules.

Community Detection

Public opinions monitor

Commodity recommendation

Network optimization

Network security

Epidemic monitor

10

Classical Community Discovery Algorithms

Clustering based on Vertex Similarity

Apply k-means or similarity-based clustering to nodes Vertex similarity is defined in terms of the similarity

of their neighborhood Structural equivalence: two nodes are structurally

equivalent iff they are connecting to the same set of actors

Structural equivalence is too restrict for practical use.

Nodes 1 and 3 are structurally equivalent; So are nodes 5 and 6.

Vertex Similarity

Jaccard Similarity

Cosine similarity

13

Linkage Clustering

The illustration of three cluster-to-cluster dissimilarity criteria. R and S are two clusters and NR; NS are the sizes of these two clusters. riR and sjS are the ith and jth object in cluster R and S respectively.

Greedy on Similarity Merge the pair of which the

distance is minimum (i.e. most similar)

The number of partitions found during the procedure is n, each with a different number of clusters, from n to 1.

At each iteration step, one needs to compute the variation Q of modularity given by the merger of any two communities of the running partition, so that one can choose the best merger.

CNM algorithm

Clauset, Newman, and Moore (CNM algorithm) Finding community structure in very large

networks A Clauset, MEJ Newman, C Moore - Physical

Review E 2004 cited times: 351

The idea of CNM is based on the greedy optimization of the quantity known as modularity

CNM is a agglomerative hierarchical method

Modularity Maximization Modularity measures the strength of a community partition

by taking into account the degree distribution Given a network with m edges, the expected number of

edges between two nodes with degrees di and dj is

Strength of a community:

Modularity:

A larger value indicates a good community structure

The expected number of edges between nodes 1 and 2 is

3*2/ (2*14) = 3/14

Given the degree distribution

CNM We view every single node as a community

initially. We repeatedly join together the two

communities whose amalgamation produces the largest increase in Q.

For a network of n vertices, after n − 1 such joins we are left with a single community and the algorithm stops.

The entire process can be represented as a tree whose leaves are the vertices of the original network and whose internal nodes correspond to the joins.

CNM

Dendrogram represents a hierarchical decomposition of the network into communitiesat all levels.

CNM algorithm

It is observed that merging communities of unbalanced sizes has great impact on computational efficiency of CNM.

Results

Girvan and Newman Method

Among the hierarchical methods, the algorithm of Girvan and Newman (Girvan & Newman 2002) presents an important improvement.

Community structure in social and biological networks

M Girvan, MEJ Newman - Proceedings of the National Academy of Sciences, 2002 - National Acad Sciencescited times ： 1302

GN method is a divisive hierarchical method.

Edge Betweenness

The strength of a tie can be measured by edge betweenness

Edge betweenness: the number of shortest paths that pass along with the edge

The edge betweenness of e(1, 2) is 4 (=6/2 + 1), as all the shortest paths from 2 to {4, 5, 6, 7, 8, 9} have to either pass e(1, 2) or e(2, 3), and e(1,2) is the shortest path between 1 and 2

22

Edge Betweenness

They use the metric called edge betweenness where betweenness is some measure that favors edges that lie between communities and disfavors those that lie inside communities.

Edge Betweenness

Define the edge betweenness of an edge as the number of shortest paths between pairs of vertices that run along it. If there more than one shortest path between a

pair of vertices each path is given equal weight such that the total weigh of all the paths is unity.

If a network contains communities or groups that are only loosely connected by a few inter-group edges, then all shortest paths between different communities must go along one of these few edges.

Edge Betweenness

Thus, the edges connecting communities will have high edge betweenness.

By removing these edges, we separate groups from one another and so reveal the underlying community structure of the graph.

Procedure

The algorithm is stated as follows:

1. Calculate the betweenness for all edges in the network.

2. Remove the edge with the highest betweenness.

3. Recalculate betweennesses for all edges excepted by the removal.

4. Repeat from step 2 until no edges remain

Divisive clustering based on edge betweenness

After remove e(4,5), the betweenness of e(4, 6) becomes 20, which is the highest;

After remove e(4,6), the edge e(7,9) has the highest betweenness value 4, and should be removed.

Initial betweenness value

27Idea: progressively removing edges with the highest betweenness

Procedure

31

Hot Directions

Hot Directions

Discovery of Overlapping Communities

Incremental algorithm

Topic-sensitive Community Discovery

Local Community Discovery

Community Discovery in Multi-relational Network

Q&A

Thanks!

community discovery in social network yunming ye department of computer science shenzhen graduate...

Documents

community partition

community discoverystudies

good community structure

division of network

network connections

large networks

expected number of edges

similaritybased clustering