information retrieval effectiveness of folksonomies on the world wide web p. jason morrison
DESCRIPTION
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison. Information retrieval (IR) on the Web Traditionally, there are 2 options: 1. Search Engines – documents added to collection automatically, full text searching using some algorithm; - PowerPoint PPT PresentationTRANSCRIPT
Information Retrieval Effectiveness of Folksonomies on the World Wide Web
P. Jason Morrison
Information retrieval (IR) on the WebTraditionally, there are 2 options:
1. Search Engines – documents added to collection automatically, full text searching using some algorithm;
2. Subject Directories – documents collected and organized into a hierarchy or taxonomy by experts.
Many sites now use a new system:
3. Folksonomies – documents collected and tagged with keywords by all users, brought together into a loose organizational system.
Folksonomies• Very little empirical study has been done on Folksonomies.
• Used by social bookmarking sites like Del.icio.us, photography sites like Flickr, and video sites like YouTube.
• Even large, established retailers like Amazon are starting to experiment with tagging.
Research Questions:1. Do web sites that employ folksonomies return relevant results to users performing information retrieval tasks, specifically searching?
2. Do folksonomies perform as well as subject directories and search engines?
Hypotheses:1. Despite different index sizes and categorization strategies, the top results from search engines, directories, and folksonomies will show some overlap. Items that show up in the results of more than one will be more likely to be judged.
2. There will be significant difference between the IR effectiveness of search engines, expert-maintained directories, and folksonomies.
3. Folksonomies will perform as well or better than search engines and directories for information needs that fall into entertainment or current event categories. They will perform less well for factual or specific-document searches.
Gordon and Pathak’s (1999) Seven Features:1. Searches should use real information needs
2. Studies should try to capture the information need, not just the query used, if possible
3. A large enough number of searches must be done to do a meaningful evaluation.
4. Most major search engines should be included
5. The special features of each engine should be utilized.
6. Relevance should be judged by the person with the information need.
Gordon and Pathak’s Seven Features, cont:7) Experiments need to be conducted so they provide meaningful measures:
• Good experimental design, such as returning results in a random order;
•Use of accepted IR measurements like Recall and Precision;
•Use of appropriate statistical tests.
Hawking, et al.’s (2001) additional feature:8) Search topics should include different types of information needs
Four different types based on the desired results:
1. A short factual statement that directly answers a question;
2. A specific document or web site that the user knows or suspects exists;
3. A selection of documents that pertain to an area of interest; or
4. An exhaustive list of every document that meets their need. (
Leighton and Srivastava
(1997)
Gordon and Pathak (1999)
Hawking et al (2001) Can et al (2003)
The Present Study
Information Needs
Provided by
Library reference desk, other studies
Faculty members Queries from web logs Computer Science Students and Professors
Graduate students
Queries Created by
The researchers
Skilled searchers Queries from web logs Same Same
Relevance Judged by
The researchers (by consensus)
Same faculty members
Research Assistants Same Same
Participants 2 33 Faculty members
6 19 34
Total queries
15 33 54 25 103
Leighton and Srivastava
(1997)
Gordon and Pathak (1999)
Hawking et al (2001) Can et al (2003)
The Present Study
Engines tested
5 8 20 8 8
Results evaluated
per engine
20 20 20 20 20
Total results evaluated / evaluator:
1500 160 3600 160 or 320 About 160
Relevancy Scale
4 categories 4-point scale Binary Binary Binary
Precision Measures:
P(20), weighted groups by rank
P(1-5), P(1-10), P(5-10), P(15-20)
P(1), P(1-5), P(5) P(20)
P(10), P(20)
P(20), P(1-5)
Recall Measures:
none Relative recall; R(15-20), R(15-25), R(40-60), R(90-110), R(180-200)
none Relative recall: R(10), R(20)
Relative recall:R(20), R(1-5)
IR systems studied• Two directories: Open Directory and Yahoo.
• Three search engines: Alta Vista, Live (Microsoft), and Google.
• Three social bookmarking systems representing the folksonomies: Del.icio.us, Furl, and Reddit.
General results• 34 users, 103 queries and 9266 total results returned.
• The queries generated by participants were generally similar to previous studies in terms of word count and use of operators.
• Previous studies of search engine logs have shown that users rarely try multiple searches and rarely look past the first set or results. This fits the current study.
• For many queries, some IR systems did not return the full 20 results. In fact there were many queries where some IR systems returning 0 results.
Hypothesis 1: Overlap in resultsNumber of
engines returning the URL
Number of unique results
Relevancy rate SD
1 7223 .1631 .36947
2 617 .2950 .45640
3 176 .3580 .48077
4 43 .4884 .50578
5 15 .4667 .51640
6 2 .0000 .00000
Total 8076 .1797 .38393
IR system type combination
Engine types returning same URL N Mean
Directory Folksonomy Search Engine
no no yes 4801 .2350
no yes no 2484 .0676
yes no no 592 .1419
no yes yes 94 .3191yes no yes 67 .4179
yes yes no 12 .1667
yes yes yes 26 .4231
Total 8076 .1797
Overlap of results findings• Almost 90% of results were returned by just one engine – fits well with previous studies.
• Results found by both search engines and folksonomies were significantly more likely to be relevant
• The directory/search engine group had a higher relevancy rate than the folksonomy/search engine group, but the difference was not significant.
• Allowing tagging or meta-searching a folksonomy could improve search engine performance.
• Hypothesis 1 is supported.
Hypothesis 2: Performance differencesPerformance measures:
• Precision
• Relative Recall
• Retrieval Rate also calculated
Performance (dcv 20)IR System Precision Recall Retrieval Rate
Open Directory Mean .172297 .023934 0.1806N 37 98 103
Yahoo Directory Mean .270558 .063767 0.1709N 36 98 103
Del.icio.us Mean .210853 .041239 0.1908N 43 98 103
Furl Mean .093840 .044975 0.5311N 75 98 103
Reddit Mean .041315 .042003 0.5617N 62 98 103
Google Mean .286022 .351736 0.8942N 93 98 103
Live Mean .235437 .341294 0.9845N 103 98 103
Alta Vista Mean .262990 .431267 0.9845N 102 98 103
Total Mean .204095 .167527 0.5623N 551 784 824
Precision at positions 1-20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
O pen D irectoryYahoo D irectoryDel.ic io.usFurlRedditG oogleLiveA ltaVista
Cutoff
Recall at positions 1-20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
O pen D irectoryYahoo D irectoryD el.ic io.usFurlR edditG oogleLiveAltaV ista
Cutoff
Average performance at dcv 1-5
IR System Type Avg Precision AvgRecall
Avg Retrieval Rate
Directory Mean0.2647
0.01830.2899
N 73 196 206
Folksonomy Mean0.1214
0.01190.5290
N180
294309
Search Engine Mean0.4194
0.12940.9631
N298
294309
Performance differences findings• There are statistically significant differences among individual IR systems and IR system types.
• Search engines had the best performance by all measures.
• In general directories had better precision than folksonomies, but difference not usually statistically significant.
• Del.icio.us performed as well or better than the directories.
• Hypothesis 2 is supported.
Hypothesis 3: Performance for different needs• Do Folksonomies perform better than the other IR systems for some information needs, and worse for others?
Comparing information need categories Info Need
CategoryIR System
TypeAvg
PrecisionAvg Recall Avg
RetrievalShort
Factual Answer
Directory Mean .218610 .009491 .349404N 12 28 28
Folksonomy Mean .060118 .007089 .601270
N 28 42 42Search Engine
Mean .440501 .095157 .952381N 40 42 42
Specific Item
Directory Mean .193333 .033187 .332540N 17 38 42
Folksonomy Mean .027187 .008421 .447513N 32 57 63
Search Engine
Mean .353550 .268214 .968254N 61 57 63
Selection of Relevant
Items
Directory Mean .304849 .015789 .264510N 44 130 136
Folksonomy Mean .160805 .013932 .539314N 120 195 204
Search Engine
Mean .435465 .096227 .963644N 197 195 204
News and entertainment searchesInformation
NeedIR System
TypeAvg
PrecisionAvg Recall Retrieval
RateNews Directory Mean .000000 .000000 .069365
N 4 40 42Folksonomy Mean .154666 .016822 .573439
N 40 60 63Search Engine
Mean .372350 .096911 .961640
N 61 60 63Entertainment Directory Mean .302223 .021324 .241111
N 6 16 18Folksonomy Mean .136221 .016272 .483457
N 15 24 27Search Engine
Mean .299065 .127569 .925926
N 25 24 27
Factual and exact site searchesInformation
NeedIR System
TypeAvg
PrecisionAvg Recall Retrieval
RateFactual Directory Mean .218610 .009491 .349404
N 12 28 28Folksonomy Mean .060118 .007089 .601270
N 28 42 42Search Engine Mean .440501 .095157 .952381
N 40 42 42Exact Site Directory Mean .193333 .033187 .332540
N 17 38 42Folksonomy Mean .027187 .008421 .447513
N 32 57 63Search Engine Mean .353550 .268214 .968254
N 61 57 63
Performance for different info needs findings• Significant differences were found among folksonomies, search engines, and directories for the three info need categories.
• When comparing within info need categories, the search engines had significantly better precision. Recalls scores were similar but not significant.
• Folksonomies did not perform significantly better for news and entertainment searches; but
• They did perform significantly worse than search engines for factual and exact site searches. Hypothesis 3 only partly supported.
What other factors impacted performance?• For the study as a whole, the use of query operators correlated negatively with recall and retrieval rate. Non-boolean operators correlated negatively with precision scores.
• When looking at just folksonomy searches, query operator use lead to even lower recall and retrieval scores.
• Some specific cases were not handled by the folksonomies. A search for movie show times at a certain zip code (“showtimes 45248 borat”) had zero results on all folksonomies.
• Queries that were limited by geography and queries with obscure topics can perform poorly in folksonomies because users might not have added/tagged items yet.
User factors• For the most part, user experience did not correlate significantly with performance measures.
• Expert users were more likely to have lower precision scores.
• Same correlation found when correcting for query factors
• Experienced users probably less likely to deem something relevant.
Recommendations• Further research is needed
• Additional folksonomies should be studied as well.
• It might be useful to collect additional types of data, such as whether or not participants clicked through to look at sites before judging.
• Additional analysis on ranking would be interesting.
• Any similar study must also deal with difficult technical issues like server and browser timeouts.
Conclusions• The overlap between folksonomy results and search engine results could be used to improve Web IR performance.
• The search engines, with their much larger collections, performed better than directories and folksonomies in almost every case.
• Folksonomies may be better than directories for some needs, but more data is required. Folksonomies are particularly bad at finding a factual answer or one specific site.
Conclusions (cont.)
• Although search engines had better performance across the board, folksonomies are promising because:
1. They are relatively new and may improve with time and additional users;
2. Search results could be improved with relatively small changes to the way query operators and search terms are used.
3. There are many variations in organization to be tried.
Future research• Look at the difference between systems that primarily use tagging (Del.icio.us, Furl) and those that use ranking (Reddit, Digg)
• Which variations are more successful? Tags, titles, categories, descriptions, comments, and even full text are collected by various folksonomies.
• Where should weight be placed? Should a document that matches the query closely rank higher than one with many votes, or vice versa?
Future research (cont.)
• Artificial situations could be set up to study absolute recall and searches for an exhaustive list of items.
• Similar studies on IR systems covering smaller domains, like video, should be done. Blog search systems in particular would be interesting.
• What about other IR behaviors such as browsing?
• There are many other fascinating topics such as the social networks in some folksonomies and what motivates users to tag items among others.