caimei lu et al. (()kdd 2010) presented by anson liang

31
Caimei Lu et al. (KDD 2010) Presented by Anson Liang

Upload: others

Post on 23-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Caimei Lu et al. (KDD 2010)( )

Presented by Anson Liang

Page 2: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Documents:articles web pages etc

Users:articles, web pages, etc.

Tags:e g economy science

Add

Annotate

e.g. economy, science, facebook, twitter …

Page 3: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

E.g. Delicious.com◦ Social

bookmarkingbookmarking

Page 4: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Users

Web pages

Tags

Page 5: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Movie, action, inspiration

Movie, action, violent

Different users have different perspectives!p p

Page 6: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Objectives◦ Simulate the generation process of social

annotations

◦ Automatically generate tags for documents based on user perspective

Page 7: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Topic Analysis using Generative ModelsL Di i hl All i (LDA) d l◦ Latent Dirichlet Allocation (LDA) model Discover topics for a given document

W d

Documents

Words1..N

N..N

A document has many words

Topics1..N

A document has many words A document may belong to many topics A word may belong to many topics A topic may be associated with many words

Page 8: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

LDA – Plate notation

α is the parameter of the uniform Dirichlet prior on the per-document topic distributions. β is the parameter of the uniform Dirichlet prior on the per-topic word distribution. θi is the topic distribution for document i, zij is the topic for the j th word in document i and zij is the topic for the j th word in document i, and wij is the specific word. The wij are the only observable variables, and the other variables are latent variables. φ is a K*V (V is the dimension of the vocabulary) Markov matrix each row of which denotes the word

distribution of a topic.

Ref: http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

Page 9: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Conditionally-independent LDA (Ramage et al.)◦ Tags are generated from the topics (the same source as the

words)words)

Page 10: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Correlated LDA (Bundschus et al.)1. Generates topics associated with the words for a

documentdocument2. Generates tags from the topics associated with the words

Page 11: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Problem◦ Assume words and tags are generated from the

same source

However, words are generated by the authors, and tags are generated by users , g g y(readers)

User information is missed in the tag generation process!

Page 12: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Three classification schemas of social tags

Page 13: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Overview◦ Two factors act on the tag generation

process:1 Th i f h d1. The topics of the document2. The perspective adopted by the user

Page 14: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Model FormulationStandard LDA

User-perspective distribution

Standard LDAX=0:Tags are generated from user perspectives

Document-topic distribution

X=1:Tags are generated from the topics learned from the words

Switch 0/1

from the words

Probability that each tag is generated from topics

Perspective-tag distribution Topic-word distribution

Topic-tag distribution

Page 15: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Document-topic distribution

User-perspective distribution

Topic-word distribution

Topic-tag distribution

Perspective-tag distribution

Page 16: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Probability that each tag is generated from topics

Page 17: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Parameter Estimation1. document-topic distribution θ(d)2. topic-word distribution φ (w)3 topic tag distribution φ (t)3. topic-tag distribution φ (t)4. user-perspective distribution θ(u)5. perspective-tag distribution ψ5 pe spect e tag d st but o ψ6. binomial distribution λ

Gibbs Sampling!

Page 18: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Parameter Estimation

Page 19: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Parameter Estimation

Page 20: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Parameter Estimation

Page 21: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Parameter Estimation

Page 22: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Datasets & Setup◦ A social tagging dataset from delicious.com after

refining 41190 documents41190 documents 4414 users 28740 unique tags

129908 i d 129908uniques words

◦ 90% for training, 10% for testing

Page 23: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Evaluation Criterion◦ Perplexity Reflects the ability of the model to predict tags for new

unseen documents A lower perplexity score -> better generalization

performance

Page 24: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Number of topics vs. number of perspectives

Page 25: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Tag Perplexity

Page 26: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Discovered Topics ◦ Observed high semantic correlations among the top

words and tags for each topic

Page 27: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Discovered Perspectives◦ Perspectives are more complicated than topics◦ The correlation among the tags assigned to each

perspective is not as obvious as those for topicsperspective is not as obvious as those for topics

Page 28: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

The generation sources of tagsTags are generated from the topics learned from the words

Tags are generated from user perspectives

Page 29: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Apply the results of the model for

1. tag recommendation

2. Personalized information retrieval

Page 30: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Topic-perspective LDA model◦ Simulate the tag generation process in a more

meaningful way

◦ Tags are generated from both the topics in the document and the user perspective

◦ May be applied in computer vision tagging problems

Page 31: Caimei Lu et al. (()KDD 2010) Presented by Anson Liang

Thank you!