probabilistic models for discovering e-communities
DESCRIPTION
Probabilistic Models for Discovering E-Communities. Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW 2006. Outline. Introduction Related Works Community-User-Topic Models Semantic Community Discovery Experiments Conclusion. Outline. - PowerPoint PPT PresentationTRANSCRIPT
Probabilistic Models for Probabilistic Models for Discovering E-Discovering E-CommunitiesCommunities
Ding Zhou, Eren Manavoglu, Jia Li,Ding Zhou, Eren Manavoglu, Jia Li,
C. Lee Giles, Hongyuan ZhaC. Lee Giles, Hongyuan Zha
The Pennsylvania State UniversityThe Pennsylvania State University
WWW 2006WWW 2006
OutlineOutline
IntroductionIntroduction
Related WorksRelated Works
Community-User-Topic ModelsCommunity-User-Topic Models
Semantic Community DiscoverySemantic Community Discovery
ExperimentsExperiments
ConclusionConclusion
OutlineOutline
IntroductionIntroduction
Related WorkRelated Work
Community-User-Topic ModelsCommunity-User-Topic Models
Semantic Community DiscoverySemantic Community Discovery
ExperimentsExperiments
ConclusionConclusion
Social Network Analysis Social Network Analysis (SNA)(SNA) SNA is an established field in sociologySNA is an established field in sociology
The goal of SNAThe goal of SNA– Discovering interpersonal relationships based on various Discovering interpersonal relationships based on various
modes of information carriers, such as emails and the Webmodes of information carriers, such as emails and the Web
The community graph structureThe community graph structure– How social actors gather into groups such that they are intrHow social actors gather into groups such that they are intr
a-group close and inter-group loosea-group close and inter-group loose
– An important characteristic of all SNsAn important characteristic of all SNs
Discovering Discovering Community from Email Community from Email CorporaCorpora Typically the SN is constructed by measuring the intensity Typically the SN is constructed by measuring the intensity
of contacts between email users.of contacts between email users.– An edge indicates a communication between two users is An edge indicates a communication between two users is
higher than certain frequency thresholdhigher than certain frequency threshold
– Problematic in some scenariosProblematic in some scenarios A spammer in the email system sends out a lot of messagesA spammer in the email system sends out a lot of messages The lack of semantic interpretationThe lack of semantic interpretation
Proposed MethodProposed Method
The inner community property within SNs are examined The inner community property within SNs are examined by analyzing the semantic information such as emailsby analyzing the semantic information such as emails
A A generative Bayesian networkgenerative Bayesian network is used to model the gene is used to model the generation of communication in an SNration of communication in an SN
Similarity among social actors are modeled as a hidden lSimilarity among social actors are modeled as a hidden layer in the proposed probabilistic modelayer in the proposed probabilistic model
OutlineOutline
IntroductionIntroduction
Related WorkRelated Work
Community-User-Topic ModelsCommunity-User-Topic Models
Semantic Community DiscoverySemantic Community Discovery
ExperimentsExperiments
ConclusionConclusion
Related Work: Document Related Work: Document Content CharacterizationContent Characterization
Several factors, either observable or latent, are modeled as Several factors, either observable or latent, are modeled as variables in the generative Bayesian networkvariables in the generative Bayesian network
Topic-Word modelTopic-Word model– Documents are considered as a mixture of topicsDocuments are considered as a mixture of topics
– Each topic corresponds to a multinomial distribution over wordsEach topic corresponds to a multinomial distribution over words
– Latent Dirichlet Allocation (LDA) [D. Blei et al., 2003]Latent Dirichlet Allocation (LDA) [D. Blei et al., 2003]
Related Work (2)Related Work (2)
Author-Word modelAuthor-Word model– The author The author xx is chosen randomly from is chosen randomly from aadd
[A. McCallum, 1999][A. McCallum, 1999]
Author-Topic modelAuthor-Topic model– Involves both the author and the topicInvolves both the author and the topic
– Perform well for document content Perform well for document content
characterization [M. Steyvers et al., 2004]characterization [M. Steyvers et al., 2004]
OutlineOutline
IntroductionIntroduction
Related WorkRelated Work
Community-User-Topic ModelsCommunity-User-Topic Models
Semantic Community DiscoverySemantic Community Discovery
ExperimentsExperiments
ConclusionConclusion
Community-User-Topic ModeCommunity-User-Topic Models (CUT)ls (CUT) Communication documentCommunication document
– A document carrier of communicationA document carrier of communication
Basic ideaBasic idea– The issue of a communication document indicates the The issue of a communication document indicates the
activities of and is also conditioned on the community activities of and is also conditioned on the community structure within an SNstructure within an SN
– Considering the community as an extra latent variable in Considering the community as an extra latent variable in the Bayesian network in addition to the author and topic the Bayesian network in addition to the author and topic variablesvariables
CUTCUT11: Modeling : Modeling Community with Users Community with Users (1)(1) Assume an SN community is more than a group of usersAssume an SN community is more than a group of users
– Similar to that assumed in a topology-based methodSimilar to that assumed in a topology-based method
– Treat each community as a multinomial distribution over Treat each community as a multinomial distribution over usersusers
CUTCUT11: Modeling : Modeling Community with Users Community with Users (2)(2) Compute the posterior probability Compute the posterior probability PP((cc, , uu, , zz||ww) by comput) by comput
ing ing PP((cc, , uu, , zz, , ww))
A possible side-effect of CUTA possible side-effect of CUT11 is it relaxes the communit is it relaxes the communit
y’s impact on the generated topicsy’s impact on the generated topics
CUTCUT22: Modeling Community : Modeling Community with Topics (1)with Topics (1) An SN community consists of a set of topicsAn SN community consists of a set of topics
CUTCUT22 differs from CUT differs from CUT11 in strengthening the relation in strengthening the relation
between community and topicbetween community and topic
CUTCUT22: Modeling Community : Modeling Community with Topics (2)with Topics (2) Similarly, compute Similarly, compute PP((cc, , uu, , zz||ww) by computing ) by computing PP((cc, , uu, , zz, ,
ww))
A possible side-effect of CUTA possible side-effect of CUT22 is it might lead to loose ti is it might lead to loose ti
es between community and userses between community and users
OutlineOutline
IntroductionIntroduction
Related WorkRelated Work
Community-User-Topic ModelsCommunity-User-Topic Models
Semantic Community DiscoverySemantic Community Discovery
ExperimentsExperiments
ConclusionConclusion
Practical Algorithm: Practical Algorithm: Gibbs SamplingGibbs Sampling Gibbs sampling is an algorithm to approximate the joint Gibbs sampling is an algorithm to approximate the joint
distribution of multiple variables by drawing a sequence distribution of multiple variables by drawing a sequence of samplesof samples
Gibbs sampling is a Markov chain Monte Carlo Gibbs sampling is a Markov chain Monte Carlo algorithm and usually applies when the conditional algorithm and usually applies when the conditional probability distribution of each variable can be evaluatedprobability distribution of each variable can be evaluated
Gibbs Sampling for Gibbs Sampling for CUTCUT
Estimation of the Estimation of the Conditional ProbabilityConditional Probability Estimating Estimating PP((ccii, , uuii, , zzii||wwii) for CUT) for CUT11 and CUT and CUT22
CUTCUT11::
CUTCUT22::
EnF-Gibbs: Gibbs Sampling wEnF-Gibbs: Gibbs Sampling with Entropy Filteringith Entropy Filtering
• Non-informative words are ignored after A times of iterations
OutlineOutline
IntroductionIntroduction
Related WorkRelated Work
Community-User-Topic ModelsCommunity-User-Topic Models
Semantic Community DiscoverySemantic Community Discovery
ExperimentsExperiments
ConclusionConclusion
Experiment SetupExperiment Setup
Data: Enron email datasetData: Enron email dataset– Made public by Federal Energy Regulatory CommissionMade public by Federal Energy Regulatory Commission
Fix the number of communities Fix the number of communities CC at 6 and the number of at 6 and the number of topics topics TT at 20 at 20
The smoothing hyper-parameters α, β and γ were set at The smoothing hyper-parameters α, β and γ were set at 5/T, 0.01 and 0.1 respectively5/T, 0.01 and 0.1 respectively
Experiment Result-1Experiment Result-1
Table 1: Topics discovered by CUTTable 1: Topics discovered by CUT11
Table 2: AbbreviationsTable 2: Abbreviations
Experiment Result-2Experiment Result-2
Fig: Communities/topics of an employeeFig: Communities/topics of an employee
Experiment Result-3Experiment Result-3
Fig: A community discovered by CUT2
Experiment Result-4Experiment Result-4
D..steffes = vice president of Enron in charge of government affairsD..steffes = vice president of Enron in charge of government affairs
Cara.semperger = a senior analystCara.semperger = a senior analyst
Mike.grigsby = a marketing managerMike.grigsby = a marketing manager
Rick.buy = chief risk management officerRick.buy = chief risk management officer
Experiment Result-5Experiment Result-5
Similarity between two clustering results:Similarity between two clustering results:
Fig: Community similarity comparisonsFig: Community similarity comparisons
Experiment Result-6Experiment Result-6
Fig: Efficiency of EnF-GibbsFig: Efficiency of EnF-Gibbs
OutlineOutline
IntroductionIntroduction
Related WorkRelated Work
Community-User-Topic ModelsCommunity-User-Topic Models
Semantic Community DiscoverySemantic Community Discovery
ExperimentsExperiments
ConclusionConclusion
Conclusion and Conclusion and Future WorkFuture Work Two versions of Community-User-Topic models are presTwo versions of Community-User-Topic models are pres
ented for community discovery in SNs.ented for community discovery in SNs.
EnF-Gibbs sampling is introduced by extending Gibbs saEnF-Gibbs sampling is introduced by extending Gibbs sampling with entropy filteringmpling with entropy filtering
Experiments show that the proposed method effectively tExperiments show that the proposed method effectively tags communities with topic semanticsags communities with topic semantics
It would be interesting to explore the predictive performaIt would be interesting to explore the predictive performance of these models on new communications between strnce of these models on new communications between strange social actors in SNsange social actors in SNs
Illustration of Dirichlet DisIllustration of Dirichlet Distributiontribution
Several images of the probability density of the Dirichlet distribution Several images of the probability density of the Dirichlet distribution when when KK=3 for various parameter vectors =3 for various parameter vectors αα. Clockwise from top left: . Clockwise from top left: αα=(6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4). =(6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4).