a statistical comparison of tag and query logs
Post on 04-Jan-2016
22 Views
Preview:
DESCRIPTION
TRANSCRIPT
A Statistical Comparison of Tag and Query Logs
Mark J. Carman, Robert Gwadera, Fabio Crestani, and Mark BaillieSIGIR 2009
June 4, 2010Hyunwoo Kim
Contents Introduction Building a Dataset Are the Distributions Similar? Investigating Website Content Conclusion
2 / 20
Introduction
tags3 / 20
Introduction Questions
1. Are queries and tags similar across URLs?2. Can tag data be used to approximate user queries to a
search engine?3. Can query logs be used to suggest new tags for a particular
webpage?4. For what types of websites is the correlation between the
term distributions for queries and tags the highest?5. Which of the distributions, tags or queries, is most closely re-
lated to the content of the clicked websites?
4 / 20
Building a Dataset AOL query log
– Sizable– Recent (2006)– English queries– Available to academic researchers– 657,426 users– A period of 3 months from March to May, 2006
Delicious tag– Collaborative tagging system
Final dataset: 4145 complete URLs– Google query, stemming, prunning
5 / 20
Are the Distributions Similar?
http://www.nytimes.com
tags
or
6 / 20
Are the Distributions Similar? Kullback-Leibler divergence
7 / 20
Are the Distributions Similar? Jensen-Shannon divergence
– Symmetric measure
Overlap coefficient
Vq : query logsVr : tags
8 / 20
Are the Distributions Similar?
9 / 20
Are the Distributions Similar? Open directory project
10 / 20
Are the Distributions Similar?
11 / 20
Are the Distributions Similar?
12 / 20
Are the Distributions Similar?
13 / 20
Are the Distributions Similar?
14 / 20
Are the Distributions Similar?
15 / 20
Are the Distributions Similar?
16 / 20
Investigating Website Content
17 / 20
Investigating Website Content
18 / 20
Conclusion Similarity between query term and tag
– Vocabularies contain a large amount of overlap– Term frequency distributions are correlated– Similarity is not dependent on the topic area
Queries are more similar to content than to tags Queries and tags are more similar to one another
than to content
Future work– Models for automatically removing noise from the tag and
query logs– Techniques for predicting useful tags from query distributions– Techniques for the effective use of tag data to improve dif-
ferent forms of Web search
19 / 20
Thank you
top related