probabilistic models for discovering e-communities

31
Probabilistic Models Probabilistic Models for Discovering E- for Discovering E- Communities Communities Ding Zhou, Eren Manavoglu, Ji Ding Zhou, Eren Manavoglu, Ji a Li, a Li, C. Lee Giles, Hongyuan Zha C. Lee Giles, Hongyuan Zha The Pennsylvania State University The Pennsylvania State University WWW 2006 WWW 2006

Upload: hansel

Post on 19-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Probabilistic Models for Discovering E-Communities. Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW 2006. Outline. Introduction Related Works Community-User-Topic Models Semantic Community Discovery Experiments Conclusion. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Probabilistic Models for Discovering E-Communities

Probabilistic Models for Probabilistic Models for Discovering E-Discovering E-CommunitiesCommunities

Ding Zhou, Eren Manavoglu, Jia Li,Ding Zhou, Eren Manavoglu, Jia Li,

C. Lee Giles, Hongyuan ZhaC. Lee Giles, Hongyuan Zha

The Pennsylvania State UniversityThe Pennsylvania State University

WWW 2006WWW 2006

Page 2: Probabilistic Models for Discovering E-Communities

OutlineOutline

IntroductionIntroduction

Related WorksRelated Works

Community-User-Topic ModelsCommunity-User-Topic Models

Semantic Community DiscoverySemantic Community Discovery

ExperimentsExperiments

ConclusionConclusion

Page 3: Probabilistic Models for Discovering E-Communities

OutlineOutline

IntroductionIntroduction

Related WorkRelated Work

Community-User-Topic ModelsCommunity-User-Topic Models

Semantic Community DiscoverySemantic Community Discovery

ExperimentsExperiments

ConclusionConclusion

Page 4: Probabilistic Models for Discovering E-Communities

Social Network Analysis Social Network Analysis (SNA)(SNA) SNA is an established field in sociologySNA is an established field in sociology

The goal of SNAThe goal of SNA– Discovering interpersonal relationships based on various Discovering interpersonal relationships based on various

modes of information carriers, such as emails and the Webmodes of information carriers, such as emails and the Web

The community graph structureThe community graph structure– How social actors gather into groups such that they are intrHow social actors gather into groups such that they are intr

a-group close and inter-group loosea-group close and inter-group loose

– An important characteristic of all SNsAn important characteristic of all SNs

Page 5: Probabilistic Models for Discovering E-Communities

Discovering Discovering Community from Email Community from Email CorporaCorpora Typically the SN is constructed by measuring the intensity Typically the SN is constructed by measuring the intensity

of contacts between email users.of contacts between email users.– An edge indicates a communication between two users is An edge indicates a communication between two users is

higher than certain frequency thresholdhigher than certain frequency threshold

– Problematic in some scenariosProblematic in some scenarios A spammer in the email system sends out a lot of messagesA spammer in the email system sends out a lot of messages The lack of semantic interpretationThe lack of semantic interpretation

Page 6: Probabilistic Models for Discovering E-Communities

Proposed MethodProposed Method

The inner community property within SNs are examined The inner community property within SNs are examined by analyzing the semantic information such as emailsby analyzing the semantic information such as emails

A A generative Bayesian networkgenerative Bayesian network is used to model the gene is used to model the generation of communication in an SNration of communication in an SN

Similarity among social actors are modeled as a hidden lSimilarity among social actors are modeled as a hidden layer in the proposed probabilistic modelayer in the proposed probabilistic model

Page 7: Probabilistic Models for Discovering E-Communities

OutlineOutline

IntroductionIntroduction

Related WorkRelated Work

Community-User-Topic ModelsCommunity-User-Topic Models

Semantic Community DiscoverySemantic Community Discovery

ExperimentsExperiments

ConclusionConclusion

Page 8: Probabilistic Models for Discovering E-Communities

Related Work: Document Related Work: Document Content CharacterizationContent Characterization

Several factors, either observable or latent, are modeled as Several factors, either observable or latent, are modeled as variables in the generative Bayesian networkvariables in the generative Bayesian network

Topic-Word modelTopic-Word model– Documents are considered as a mixture of topicsDocuments are considered as a mixture of topics

– Each topic corresponds to a multinomial distribution over wordsEach topic corresponds to a multinomial distribution over words

– Latent Dirichlet Allocation (LDA) [D. Blei et al., 2003]Latent Dirichlet Allocation (LDA) [D. Blei et al., 2003]

Page 9: Probabilistic Models for Discovering E-Communities

Related Work (2)Related Work (2)

Author-Word modelAuthor-Word model– The author The author xx is chosen randomly from is chosen randomly from aadd

[A. McCallum, 1999][A. McCallum, 1999]

Author-Topic modelAuthor-Topic model– Involves both the author and the topicInvolves both the author and the topic

– Perform well for document content Perform well for document content

characterization [M. Steyvers et al., 2004]characterization [M. Steyvers et al., 2004]

Page 10: Probabilistic Models for Discovering E-Communities

OutlineOutline

IntroductionIntroduction

Related WorkRelated Work

Community-User-Topic ModelsCommunity-User-Topic Models

Semantic Community DiscoverySemantic Community Discovery

ExperimentsExperiments

ConclusionConclusion

Page 11: Probabilistic Models for Discovering E-Communities

Community-User-Topic ModeCommunity-User-Topic Models (CUT)ls (CUT) Communication documentCommunication document

– A document carrier of communicationA document carrier of communication

Basic ideaBasic idea– The issue of a communication document indicates the The issue of a communication document indicates the

activities of and is also conditioned on the community activities of and is also conditioned on the community structure within an SNstructure within an SN

– Considering the community as an extra latent variable in Considering the community as an extra latent variable in the Bayesian network in addition to the author and topic the Bayesian network in addition to the author and topic variablesvariables

Page 12: Probabilistic Models for Discovering E-Communities

CUTCUT11: Modeling : Modeling Community with Users Community with Users (1)(1) Assume an SN community is more than a group of usersAssume an SN community is more than a group of users

– Similar to that assumed in a topology-based methodSimilar to that assumed in a topology-based method

– Treat each community as a multinomial distribution over Treat each community as a multinomial distribution over usersusers

Page 13: Probabilistic Models for Discovering E-Communities

CUTCUT11: Modeling : Modeling Community with Users Community with Users (2)(2) Compute the posterior probability Compute the posterior probability PP((cc, , uu, , zz||ww) by comput) by comput

ing ing PP((cc, , uu, , zz, , ww))

A possible side-effect of CUTA possible side-effect of CUT11 is it relaxes the communit is it relaxes the communit

y’s impact on the generated topicsy’s impact on the generated topics

Page 14: Probabilistic Models for Discovering E-Communities

CUTCUT22: Modeling Community : Modeling Community with Topics (1)with Topics (1) An SN community consists of a set of topicsAn SN community consists of a set of topics

CUTCUT22 differs from CUT differs from CUT11 in strengthening the relation in strengthening the relation

between community and topicbetween community and topic

Page 15: Probabilistic Models for Discovering E-Communities

CUTCUT22: Modeling Community : Modeling Community with Topics (2)with Topics (2) Similarly, compute Similarly, compute PP((cc, , uu, , zz||ww) by computing ) by computing PP((cc, , uu, , zz, ,

ww))

A possible side-effect of CUTA possible side-effect of CUT22 is it might lead to loose ti is it might lead to loose ti

es between community and userses between community and users

Page 16: Probabilistic Models for Discovering E-Communities

OutlineOutline

IntroductionIntroduction

Related WorkRelated Work

Community-User-Topic ModelsCommunity-User-Topic Models

Semantic Community DiscoverySemantic Community Discovery

ExperimentsExperiments

ConclusionConclusion

Page 17: Probabilistic Models for Discovering E-Communities

Practical Algorithm: Practical Algorithm: Gibbs SamplingGibbs Sampling Gibbs sampling is an algorithm to approximate the joint Gibbs sampling is an algorithm to approximate the joint

distribution of multiple variables by drawing a sequence distribution of multiple variables by drawing a sequence of samplesof samples

Gibbs sampling is a Markov chain Monte Carlo Gibbs sampling is a Markov chain Monte Carlo algorithm and usually applies when the conditional algorithm and usually applies when the conditional probability distribution of each variable can be evaluatedprobability distribution of each variable can be evaluated

Page 18: Probabilistic Models for Discovering E-Communities

Gibbs Sampling for Gibbs Sampling for CUTCUT

Page 19: Probabilistic Models for Discovering E-Communities

Estimation of the Estimation of the Conditional ProbabilityConditional Probability Estimating Estimating PP((ccii, , uuii, , zzii||wwii) for CUT) for CUT11 and CUT and CUT22

CUTCUT11::

CUTCUT22::

Page 20: Probabilistic Models for Discovering E-Communities

EnF-Gibbs: Gibbs Sampling wEnF-Gibbs: Gibbs Sampling with Entropy Filteringith Entropy Filtering

• Non-informative words are ignored after A times of iterations

Page 21: Probabilistic Models for Discovering E-Communities

OutlineOutline

IntroductionIntroduction

Related WorkRelated Work

Community-User-Topic ModelsCommunity-User-Topic Models

Semantic Community DiscoverySemantic Community Discovery

ExperimentsExperiments

ConclusionConclusion

Page 22: Probabilistic Models for Discovering E-Communities

Experiment SetupExperiment Setup

Data: Enron email datasetData: Enron email dataset– Made public by Federal Energy Regulatory CommissionMade public by Federal Energy Regulatory Commission

Fix the number of communities Fix the number of communities CC at 6 and the number of at 6 and the number of topics topics TT at 20 at 20

The smoothing hyper-parameters α, β and γ were set at The smoothing hyper-parameters α, β and γ were set at 5/T, 0.01 and 0.1 respectively5/T, 0.01 and 0.1 respectively

Page 23: Probabilistic Models for Discovering E-Communities

Experiment Result-1Experiment Result-1

Table 1: Topics discovered by CUTTable 1: Topics discovered by CUT11

Table 2: AbbreviationsTable 2: Abbreviations

Page 24: Probabilistic Models for Discovering E-Communities

Experiment Result-2Experiment Result-2

Fig: Communities/topics of an employeeFig: Communities/topics of an employee

Page 25: Probabilistic Models for Discovering E-Communities

Experiment Result-3Experiment Result-3

Fig: A community discovered by CUT2

Page 26: Probabilistic Models for Discovering E-Communities

Experiment Result-4Experiment Result-4

D..steffes = vice president of Enron in charge of government affairsD..steffes = vice president of Enron in charge of government affairs

Cara.semperger = a senior analystCara.semperger = a senior analyst

Mike.grigsby = a marketing managerMike.grigsby = a marketing manager

Rick.buy = chief risk management officerRick.buy = chief risk management officer

Page 27: Probabilistic Models for Discovering E-Communities

Experiment Result-5Experiment Result-5

Similarity between two clustering results:Similarity between two clustering results:

Fig: Community similarity comparisonsFig: Community similarity comparisons

Page 28: Probabilistic Models for Discovering E-Communities

Experiment Result-6Experiment Result-6

Fig: Efficiency of EnF-GibbsFig: Efficiency of EnF-Gibbs

Page 29: Probabilistic Models for Discovering E-Communities

OutlineOutline

IntroductionIntroduction

Related WorkRelated Work

Community-User-Topic ModelsCommunity-User-Topic Models

Semantic Community DiscoverySemantic Community Discovery

ExperimentsExperiments

ConclusionConclusion

Page 30: Probabilistic Models for Discovering E-Communities

Conclusion and Conclusion and Future WorkFuture Work Two versions of Community-User-Topic models are presTwo versions of Community-User-Topic models are pres

ented for community discovery in SNs.ented for community discovery in SNs.

EnF-Gibbs sampling is introduced by extending Gibbs saEnF-Gibbs sampling is introduced by extending Gibbs sampling with entropy filteringmpling with entropy filtering

Experiments show that the proposed method effectively tExperiments show that the proposed method effectively tags communities with topic semanticsags communities with topic semantics

It would be interesting to explore the predictive performaIt would be interesting to explore the predictive performance of these models on new communications between strnce of these models on new communications between strange social actors in SNsange social actors in SNs

Page 31: Probabilistic Models for Discovering E-Communities

Illustration of Dirichlet DisIllustration of Dirichlet Distributiontribution

Several images of the probability density of the Dirichlet distribution Several images of the probability density of the Dirichlet distribution when when KK=3 for various parameter vectors =3 for various parameter vectors αα. Clockwise from top left: . Clockwise from top left: αα=(6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4). =(6, 2, 2), (3, 7, 5), (6, 2, 6), (2, 3, 4).