Optimizing Web Search Optimizing Web Search Using Social AnnotationsUsing Social Annotations
Shenghua Bao, Xiaoyuan Wu, Guirong Xue, Yong YuShenghua Bao, Xiaoyuan Wu, Guirong Xue, Yong YuShanghai JiaoTong UniversityShanghai JiaoTong University
Ben Fei, Zhong SuBen Fei, Zhong SuIBM China Research LabIBM China Research Lab
WWW 2007WWW 2007
22
Introduction (1/3)Introduction (1/3)
Two general aspects on improving web searchTwo general aspects on improving web search– Ordering the web pages according to the query-document Ordering the web pages according to the query-document
similaritysimilarityEx: Anchor text generation, search log mining …etcEx: Anchor text generation, search log mining …etc
– Ordering the web pages according to their qualitiesOrdering the web pages according to their qualitiesStatic rankingStatic ranking
Ex: PageRank, HITS …etcEx: PageRank, HITS …etc
33
Introduction (2/3)Introduction (2/3)
Social annotation service (= social bookmarking)Social annotation service (= social bookmarking)– Developed for web users to organize and share their favorite Developed for web users to organize and share their favorite
web pages online by social annotationsweb pages online by social annotations– Emergent useful information that has been explored for Emergent useful information that has been explored for
folksonomy, visualization, semantic web, etcfolksonomy, visualization, semantic web, etc– DeliciousDelicious
44
Introduction (3/3)Introduction (3/3)
Utilizing social annotations for better web search from Utilizing social annotations for better web search from the two aspects:the two aspects:– Similarity rankingSimilarity ranking
The annotations provided by web users are usually good summaries The annotations provided by web users are usually good summaries (new metadata) of the corresponding web pages(new metadata) of the corresponding web pages
The annotation data may be sparse and incompleteThe annotation data may be sparse and incomplete
SocialSimRank (SSR) algorithmSocialSimRank (SSR) algorithm
– Static rankingStatic rankingThe amount of annotations assigned to a page indicates its The amount of annotations assigned to a page indicates its popularity and implies its quality in some sensepopularity and implies its quality in some sense
Different annotation may have different weights in indicating the Different annotation may have different weights in indicating the popularity of web pagespopularity of web pages
SocialPageRank (SPR) algorithmSocialPageRank (SPR) algorithm
55
Search with Social AnnotationSearch with Social Annotation
Web page annotators provide cleaner data for users’ browsingWeb page annotators provide cleaner data for users’ browsing
Similar or closely related annotations are usually given to the Similar or closely related annotations are usually given to the same web pagessame web pages
66
Similarity Ranking between the Similarity Ranking between the Query and Social AnnotationsQuery and Social Annotations
Term-MatchingTerm-Matching based similarity ranking based similarity ranking– suffers from the synonymy problemsuffers from the synonymy problem
qq={={qq11,,qq22,…, ,…, qqnn}, }, AA((pp)={a)={a11,a,a22,…, a,…, amm}}
Social Similarity RankingSocial Similarity Ranking (SSR) (SSR)
Observation 1Observation 1:: Similar (semantically-related) annotations are Similar (semantically-related) annotations are usually assigned to similar (semantically-related) web pages by usually assigned to similar (semantically-related) web pages by users with common interests. In the social annotation environment, users with common interests. In the social annotation environment, the similarity among annotations in various forms can further be the similarity among annotations in various forms can further be identified by the common web pages they annotated.identified by the common web pages they annotated.
|)(|
|)(|),(
pA
pAqpqsimTM
77
Illustration of SocialSimRankIllustration of SocialSimRank
AA((aa)={ubuntu}, )={ubuntu}, AA((bb)={linux,ubuntu}, )={linux,ubuntu}, AA((cc)={gnome,linux,ubuntu})={gnome,linux,ubuntu}
PP(ubuntu)={(ubuntu)={a,b,ca,b,c}, }, PP(linux)={(linux)={b,cb,c}, }, PP(gnome)={(gnome)={cc}}
MMAPAP(ubuntu, (ubuntu, aa)=1, )=1, MMAPAP(linux, (linux, bb)=1, )=1, MMAPAP(gnome, (gnome, cc)=2)=2
88
n
i
m
j
jiA pAqSpqsimSSR
1 1
))(,(),(
99
Page Quality Estimation Using Page Quality Estimation Using Social AnnotationsSocial Annotations
Observation 2Observation 2:: High quality web pages are usually popularly High quality web pages are usually popularly annotated. annotated. Popular web pagesPopular web pages, , up-to-date web usersup-to-date web users and and hot social hot social annotationsannotations usually have the following relations: 1) popular web usually have the following relations: 1) popular web pages are bookmarked by many up-to-date users and annotated by pages are bookmarked by many up-to-date users and annotated by hot annotations; 2) up-to-date users like to bookmark popular pages hot annotations; 2) up-to-date users like to bookmark popular pages and use hot annotations; 3) hot annotations are used to annotate and use hot annotations; 3) hot annotations are used to annotate popular web pages and used by up-to-date users.popular web pages and used by up-to-date users.
NotationsNotationsMMPUPU: : NNP P × × NNUU association matrix between pages and users association matrix between pages and users
MMUAUA: : NNU U ×× N NAA association matrix between users and annotations association matrix between users and annotations
MMAPAP: : NNA A ×× N NPP association matrix between annotations and pages association matrix between annotations and pages
PP00: vector of randomly initialized SocialPageRank scores: vector of randomly initialized SocialPageRank scores
1010
SocialPageRank AlgorithmSocialPageRank Algorithm
1111
Illustration of Quality Transition in Illustration of Quality Transition in the SPR Algorithmthe SPR Algorithm
1212
Dynamic Ranking with Social Dynamic Ranking with Social InformationInformation
Dynamic ranking methodDynamic ranking method– RankSVMRankSVM
FeaturesFeatures
1313
Experiment Data (1/2)Experiment Data (1/2)
Delicious dataDelicious data– 1,736,268 web pages and 269.566 annotations are crawled 1,736,268 web pages and 269.566 annotations are crawled
from from DeliciousDelicious during May, 2006. during May, 2006.– Split compound annotations into standard words with the help Split compound annotations into standard words with the help
of WordNetof WordNet
ex: java.programming ex: java.programming java, programming java, programming
1414
Experiment Data (2/2)Experiment Data (2/2)
Test set for dynamic ranking with social annotationTest set for dynamic ranking with social annotation– Manual query set (MQ)Manual query set (MQ)
50 queries and their corresponding ground truths in Delicious data 50 queries and their corresponding ground truths in Delicious data manually created by CS studentsmanually created by CS students
Pooling: judge the top 100 documents returned by LucenePooling: judge the top 100 documents returned by Lucene
– Automatic query set (AQ) from Open Directory Project (ODP)Automatic query set (AQ) from Open Directory Project (ODP)Merging Delicious data with ODP and discarding ODP categories Merging Delicious data with ODP and discarding ODP categories that contain no Delicious URLsthat contain no Delicious URLs
Randomly sample 3000 ODP categories and extract the category Randomly sample 3000 ODP categories and extract the category paths as the query set and the corresponding web pagespaths as the query set and the corresponding web pagesex: extract path ex: extract path TOP/Computer/Software/Graphics TOP/Computer/Software/Graphics as “as “Computer Software GraphicsComputer Software Graphics””
– 5-fold cross validation for each query set5-fold cross validation for each query set
1515
Evaluation of Annotation Evaluation of Annotation SimilaritiesSimilarities
Table. Explored similar annotations based on SocialSimRank
1616
PageRank vs. Average CountPageRank vs. Average Count
1717
SPR vs. PageRankSPR vs. PageRank
SPR is normalized into a scale of 0-10 so that SPR and PageRank have SPR is normalized into a scale of 0-10 so that SPR and PageRank have the same number of pages in each grade from 0 to 10the same number of pages in each grade from 0 to 10
The pages with each PageRank value diversify a lot on the number of The pages with each PageRank value diversify a lot on the number of annotations and usersannotations and users
SPR successfully characterizes the web pages’ popularity degrees among SPR successfully characterizes the web pages’ popularity degrees among web annotatorsweb annotators
1818
Results of Dynamic Ranking (1/2)Results of Dynamic Ranking (1/2)
Table. Comparison of MAP between similarity featuresTable. Comparison of MAP between similarity features
MethodMethod MQ50MQ50 AQ3000AQ3000
Baseline (BM25)Baseline (BM25) 0.41150.4115 0.10910.1091
Baseline+TMBaseline+TM 0.43410.4341 0.11280.1128
Baseline+SSRBaseline+SSR 0.46970.4697 0.11470.1147
Baseline+PRBaseline+PR 0.41410.4141 0.11660.1166
Baseline+SPRBaseline+SPR 0.42780.4278 0.12250.1225
Baseline+SSR,SPRBaseline+SSR,SPR 0.4724 (+14.80%)0.4724 (+14.80%) 0.1364 (+25.02%)0.1364 (+25.02%)
1919
Results of Dynamic Ranking (2/2)Results of Dynamic Ranking (2/2)
Figure. Figure. NDCG at K for comparison of baseline, baseline+TM, NDCG at K for comparison of baseline, baseline+TM, baseline+SSR, baseline+SSR, baseline+PR, and baseline+SPR on query set AQbaseline+PR, and baseline+SPR on query set AQ
2020
DiscussionsDiscussions
There are still several problems to further addressThere are still several problems to further address– Annotation CoverageAnnotation Coverage
The user submitted queries may not match any social annotationThe user submitted queries may not match any social annotation
Many web pages may have no annotations: 1) newly emerging web Many web pages may have no annotations: 1) newly emerging web pages; 2) key-page-associated web pages while users tend to pages; 2) key-page-associated web pages while users tend to annotate key pages only; 3) uninteresting web pages.annotate key pages only; 3) uninteresting web pages.
– Annotation AmbiguityAnnotation AmbiguitySSR may find the similar terms to the query terms while fail to SSR may find the similar terms to the query terms while fail to disambiguate terms that have more than one meaningsdisambiguate terms that have more than one meanings
– Annotation SpammingAnnotation SpammingAs social annotation becomes more and more popular, the amount As social annotation becomes more and more popular, the amount of spam could drastically increase in the near futureof spam could drastically increase in the near future
2121
ConclusionConclusion
The problem of integrating social annotations into web The problem of integrating social annotations into web search is studied.search is studied.
We observed that social annotations could benefit web We observed that social annotations could benefit web search in both similarity ranking and static ranking.search in both similarity ranking and static ranking.
The experimental results showed that SSR can The experimental results showed that SSR can successfully find the latent semantic relations among successfully find the latent semantic relations among annotations and SPR can provide the static ranking from annotations and SPR can provide the static ranking from the web annotators perspective.the web annotators perspective.
In the future, we would optimize the proposed algorithms In the future, we would optimize the proposed algorithms and explore more sophisticated social features.and explore more sophisticated social features.