community discovery in social network yunming ye department of computer science shenzhen graduate...
TRANSCRIPT
Community Discovery in Social Network
Yunming Ye
Department of Computer Science
Shenzhen Graduate School
Harbin Institute of Technology
2
Agenda Introduction to Social Network and
Community Discovery
Classical Community Discovery Algorithms
Hot Research Issues
3
Introduction to Social Network and Community Discovery
Studies on Networks
Lots of “Networked” data!! Technological networks
Power-grid, road networks Biological networks
Food-web, protein networks Social networks
Collaboration networks, friendships Language networks
Semantic networks
Studies on Networks
Social Networks QQ Kaixin Renren Facebook Email Twitter Co-citation Blog
Community A property that seems to be common to many
networks is community structure.
Community: The division of network nodes into groups within which the network connections are dense, but between which they are sparser.
Subjectivity of Community Definition
Each component is a communityA densely-knit
community
Definition of a community can be subjective.
(unsupervised learning)
Definition of a community can be subjective.
(unsupervised learning)
Community Detection Community Detection: Find the community
structure from the social network. Community detection is important:
Identifying modules and their boundaries allows for a classification of vertices, according to their structural position in the modules.
Community Detection
Public opinions monitor
Commodity recommendation
Network optimization
Network security
Epidemic monitor
10
Classical Community Discovery Algorithms
Clustering based on Vertex Similarity
Apply k-means or similarity-based clustering to nodes Vertex similarity is defined in terms of the similarity
of their neighborhood Structural equivalence: two nodes are structurally
equivalent iff they are connecting to the same set of actors
Structural equivalence is too restrict for practical use.
Nodes 1 and 3 are structurally equivalent; So are nodes 5 and 6.
Vertex Similarity
Jaccard Similarity
Cosine similarity
13
Linkage Clustering
The illustration of three cluster-to-cluster dissimilarity criteria. R and S are two clusters and NR; NS are the sizes of these two clusters. riR and sjS are the ith and jth object in cluster R and S respectively.
Greedy on Similarity Merge the pair of which the
distance is minimum (i.e. most similar)
The number of partitions found during the procedure is n, each with a different number of clusters, from n to 1.
At each iteration step, one needs to compute the variation Q of modularity given by the merger of any two communities of the running partition, so that one can choose the best merger.
CNM algorithm
Clauset, Newman, and Moore (CNM algorithm) Finding community structure in very large
networks A Clauset, MEJ Newman, C Moore - Physical
Review E 2004 cited times: 351
The idea of CNM is based on the greedy optimization of the quantity known as modularity
CNM is a agglomerative hierarchical method
Modularity Maximization Modularity measures the strength of a community partition
by taking into account the degree distribution Given a network with m edges, the expected number of
edges between two nodes with degrees di and dj is
Strength of a community:
Modularity:
A larger value indicates a good community structure
The expected number of edges between nodes 1 and 2 is
3*2/ (2*14) = 3/14
Given the degree distribution
CNM We view every single node as a community
initially. We repeatedly join together the two
communities whose amalgamation produces the largest increase in Q.
For a network of n vertices, after n − 1 such joins we are left with a single community and the algorithm stops.
The entire process can be represented as a tree whose leaves are the vertices of the original network and whose internal nodes correspond to the joins.
CNM
Dendrogram represents a hierarchical decomposition of the network into communitiesat all levels.
CNM algorithm
It is observed that merging communities of unbalanced sizes has great impact on computational efficiency of CNM.
Results
Girvan and Newman Method
Among the hierarchical methods, the algorithm of Girvan and Newman (Girvan & Newman 2002) presents an important improvement.
Community structure in social and biological networks
M Girvan, MEJ Newman - Proceedings of the National Academy of Sciences, 2002 - National Acad Sciencescited times : 1302
GN method is a divisive hierarchical method.
Edge Betweenness
The strength of a tie can be measured by edge betweenness
Edge betweenness: the number of shortest paths that pass along with the edge
The edge betweenness of e(1, 2) is 4 (=6/2 + 1), as all the shortest paths from 2 to {4, 5, 6, 7, 8, 9} have to either pass e(1, 2) or e(2, 3), and e(1,2) is the shortest path between 1 and 2
22
Edge Betweenness
They use the metric called edge betweenness where betweenness is some measure that favors edges that lie between communities and disfavors those that lie inside communities.
Edge Betweenness
Define the edge betweenness of an edge as the number of shortest paths between pairs of vertices that run along it. If there more than one shortest path between a
pair of vertices each path is given equal weight such that the total weigh of all the paths is unity.
If a network contains communities or groups that are only loosely connected by a few inter-group edges, then all shortest paths between different communities must go along one of these few edges.
Edge Betweenness
Thus, the edges connecting communities will have high edge betweenness.
By removing these edges, we separate groups from one another and so reveal the underlying community structure of the graph.
Procedure
The algorithm is stated as follows:
1. Calculate the betweenness for all edges in the network.
2. Remove the edge with the highest betweenness.
3. Recalculate betweennesses for all edges excepted by the removal.
4. Repeat from step 2 until no edges remain
Divisive clustering based on edge betweenness
After remove e(4,5), the betweenness of e(4, 6) becomes 20, which is the highest;
After remove e(4,6), the edge e(7,9) has the highest betweenness value 4, and should be removed.
Initial betweenness value
27Idea: progressively removing edges with the highest betweenness
Procedure
Procedure
Procedure
31
Hot Directions
Hot Directions
Discovery of Overlapping Communities
Incremental algorithm
Topic-sensitive Community Discovery
Local Community Discovery
Community Discovery in Multi-relational Network
Q&A
Thanks!