using context to improve the evaluation of information retrieval systems

8/6/2019 Using Context to Improve the Evaluation of Information Retrieval Systems

1/18

International Journal of Database Management Systems ( IJDMS ), Vol.3, No.2, May 2011

DOI: 10.5121/ijdms.2011.3202 22

USING CONTEXT TO IMPROVE THE EVALUATION

OF INFORMATION RETRIEVAL SYSTEMS

Abdelkrim Bouramoul1, Mohamed-Khireddine Kholladi1 and Bich-Lien Doan2

1 Computer Science Department, Misc Laboratory, University of Mentouri Constantine.B.P. 325, Constantine 25017, Algeria

[email protected], [email protected].

2Computer Science Department, SUPELEC. Rue Joliot-Curie, 91192 Gif Sur Yvette, [email protected]

ABSTRACT

The crucial role of the evaluation in the development of the information retrieval tools is useful evidence to

improve the performance of these tools and the quality of results that they return. However, the

classic evaluation approaches have limitations and shortcomings especially regarding to the user

consideration, the measure of the adequacy between the query and the returned documents and the

consideration of characteristics, specifications and behaviors of the search tool. Therefore, we believe that

the exploitation of contextual elements could be a very good way to evaluate the search tools. So, this paper

presents a new approach that takes into account the context during the evaluation process at three

complementary levels. The experiments gives at the end of this article has shown the applicability of the

proposed approach to real research tools. The tests were performed with the most popular searching

engine (i.e. Google, Bing and Yahoo) selected in particular for their high selectivity. The obtained

results revealed that the ability of these engines to rejecting dead links, redundant results and parasites

pages depends strongly to how queries are formulated, and to the political of sites offering this information

to present their content. The relevance evaluation of results provided by these engines, using the user's

judgments, then using an automatic manner to take into account the query context has also shown a

general decline in the perceived relevance according to the number of the considered results.

KEYWORDS

Contextual Evaluation, Evaluation Campaigns, Relevance Judgments, Information Retrieval, Web Search

Engine.

1.INTRODUCTION

Information Retrieval (IR) is now an activity of great importance to the extent that it became oneof the most important actors in the rapid development of new information and communicationtechnologies. It must be possible, among the large volume of documents available, finding thosethat best fit our needs in the shortest time, for this purpose, information retrieval tools have beendeveloped to help locate information in a closed corpus of documents or among the entire

document available on the web. Consequently several questions arise about these informationretrieval tools, particularly in terms of their performance and the relevance of the results that theyoffer.

It is therefore in the field of evaluation of information retrieval systems and more specifically thatof the contextual evaluation that our work falls. After a deep investigation around research andsynthesis activities we realized that despite the abundant literature produced in this area dealingwith both experimental results and methods that provide evaluation criteria and metrics of


2/18


3/18


24

current search session, the sense of an ambiguous word using a thesaurus or ontology, We quotein this category the work of [7] that uses an ontology with equivalence andsubsumption relationships for extracting terms to be added to the initial query.

Another simpler way to use the context in a pre-search phase is to use it in the introduction of

booleans constraints on the existing algorithms of information retrieval, these algorithms can alsoconsider the spatiotemporal context in which the continuous values can be described in a non-specific manner to different granularity levels [8]. For example, an event can take place at 9:57, atabout 10am or in the morning. In this case, the context can be used for selecting the appropriaterepresentation.

2.2.2. During the search process

The context can also be considered in the interactions with the system. Indeed, in an informationretrieval process, is the interaction that makes possible the real exploitation of the displayedresults. The user is particularly adept to extract information from an environment thatcontrol directly and actively compared to an environment that he can only observed it passively[9]. The context at this level depends on the user action in a given situation, on the feedback, on

the relevance judgments that are related to characteristics of different users' situations, to themultidimensional research strategies, and other informational practices in information retrieval.

2.2.3. At the end of the research process

The context may finally be considered in a post-search phase after obtaining the results by usingthe relevance feedback principle. The idea of this technique is to achieve a first search using onlythe query terms, and the user can then indicate which are, among the best document of this firstsearch, those that are relevant and those which are not, and the system uses this information torefine the search by changing the weights of query terms using an automatic learning methods asin the work of [10]. Another way to use the context with the relevant feedback has been proposedmore recently in our work [3] where we propose a contextual query reformulation based on userprofiles using the concept of static and dynamic context to minimize the user intervention in the

process of reformulation.

3.RELATED WORK

The classic evaluation of information retrieval systems is based on the performance of thesystems in themselves; it is quantitative and is based on work done in the sixties at Cranfield(United Kingdom) on indexing systems [2] . This type of approach provides a comparativeevaluation basis of the effectiveness of different algorithms, of techniques and/or of systemsthrough common resources: test collections containing documents, previously prepared queriesand associated relevance judgments, and finally evaluation metrics essentially based on the recalland precision. [11]

3.1. Evaluation campaigns

The evaluation campaign represents the current dominant model. Indeed, it is on the experienceof the Cranfield tests that was based the NIST (National Institute of Science and Technology) tocreate the TREC evaluation campaign (Text REtrieval Conference) in 1992. The TRECcampaigns have become the reference in the evaluation of systems but we can also quote theCLEF Campaigns (Cross-Language Evaluation Forum) which specifically relate to themultilingual systems, the NTCIR campaigns on the Asian languages, and Amaryllis, specializingin French systems.


4/18


25

3.1.1. The TREC evaluation campaign

This is a series of annual evaluation of information retrieval technologies. The TREC is aninternational project initiated in the early 90s by the NIST (Institute in the United States), in orderto propose homogeneous means for the evaluation of documentation systems on a consistent basis

of documents. The participants are usually researchers for large companies which offer systemsand that want to improve it, small vendors that specialize in the information retrieval or academicresearch groups.

The TREC is now considered as the most important development in experimental informationretrieval. The TREC program has had a very important impact in the field, and remains the mostcited and used by the information retrieval community. The main explored tracks are filtering,research (or ad hoc task), interactive, Web and question-answering. For 2010 TREC has focusedon the following tracks: The blog, chemical IR, entity, legal, relevance feedback, and sessiontracks.1

3.1.2. The CLEF campaign

In 2000 is launched the European project of evaluating information retrieval systems, this projectis called CLEF (Cross Language Evaluation Forum). The objective of the CLEF project is topromote research in the field of multilingual system development. This is done through theorganization of annual evaluation campaigns in which a series of tracks designed to test differentaspects of mono- and cross-language information retrieval are offered. The intention is toencourage experimentation with all kinds of multilingual information access from thedevelopment of systems for monolingual retrieval operating on many languages to theimplementation of complete multilingual multimedia search services. This has been achieved byoffering an increasingly complex and varied set of evaluation tasks over the years. The aim is notonly to meet but also to anticipate the emerging needs of the R&D community and to encouragethe development of next generation multilingual IR systems.

CLEF 2009 offered eight main tracks designed to evaluate the performance of systems, the most

important of these tasks are: Multilingual textual document retrieval, interactive cross-languageretrieval, cross-language retrieval in image collections, intellectual property and log file analysis[12].

3.2. Limits of classic approaches for evaluating IRS

Despite the popularity and recognition of these two evaluation campaigns that are TREC andCLEF. These approaches for evaluating information retrieval systems have some limitsparticularly with regard to the user consideration, the constitution of the queries corpus but alsoabout the evaluation itself.

To better identify the limits of classic approaches for evaluating information retrieval systems,and basing on the work of [1] , [2], [13]. A Synthesis of this work has allowed us to define threeclasses of problems. Each class is related to an actor who is generally present around anevaluation process; these limits are those related to the absence of the user in the evaluationprocess, those related to the relevance judgments, and finally the limits related to the corpusof documents and queries.

1 TREC web site : http://trec.nist.gov/


5/18


26

3.2.1. Limits in relation to the user

We can reproach these evaluation approaches to be artificial and arbitrary. While TREC haseffectively improved the efficiency of the system, the notion of the end user implies personalknowledge, experience and different research capabilities, for which the system evaluation does

not care. Indeed, such evaluations ignore the context in which the research is conducted sincethey are not performed in real use situations. In this context [13] asserts that the absence of theuser in the evaluation process is one of the first and probably most important critique of classicapproaches using criteria other than those of the recall and precision when they initiate or end asearch session.

3.2.2. Limits in relation to judgments of relevance

The relevance is a subjective notion and it seems unthinkable to measure it without beingarbitrary. We also note that the relevance judgments in TREC operate on a binary manner: adocument is considered as relevant or irrelevant. Yet this is obviously not always the case, somedocuments are more relevant than others who are also relevant. These degrees of relevance arestill dependent on the mindset of the person who actually needs these documents. This finding isvalidated by the work of [1] showing that the relevance considered in the classic evaluation of

IRS is thematic, independent of context, of the research situation and interests of users. Similarlythe work of [13] have shown that relevance judgments should be revised in the sense that they arestable and do not vary over time, and that they are assigned independently of each other.

3.2.3. Limits in relation to the corpus of documents and of queries

In the traditional corpus, a document is a text in itself, and the evaluation is made compared to thenumber of documents found, but in general, a user is not looking documents but information, anddocuments never contain the same amount of information. Similarly for corpus of queries wherethe query is a need for information expressed in natural language. However, the representation ofthe information needs of the user is itself a problem. The IR task becomes a task of know askquestions to these systems because the differences are significant between what we think andwhat is interpreted. [1] Notes that in the batch mode of evaluation protocols, queries are assumedto represent alone the user. Consequently the direct users having made these queries, theirinterests and their interactions with the IRS does not form part of the collection.

This critical finding prompted our reflections around an appropriate approach for contextualevaluating of the information retrieval systems. In the rest of this paper, we describe ourapproach for taking context into consideration during the evaluation process.

4.DETAILED PRESENTATION OF THE PROPOSED APPROACH

Our evaluation approach is to evaluate the performance of the tool used for information retrievaland measure the quality of services that it offers in one side, and on another side to evaluate therelevance of the results that it returns. It takes the user into consideration during evaluation in thesense that it contributes to the evaluation process by giving his relevance judgment according to

his information need. The proposed approach therefore consists of three parts: evaluation ofperformance of the search tool, evaluation of the relevance of results compared to the query, andfinally evaluation of the relevance by the user's judgments. Fig. 1, summarizes the three levels ofevaluation and illustrates the link between the context type and the evaluation level.


6/18


27

Figure 1. Link between the context type and the evaluation level

We chose to consider three types of context modelled in our approach with three complementaryevaluation levels, these are:

-System context: at this level it comes to diagnosing performance, characteristics,specifications and behavior of the search tool for the considered query.

-Query context: this is to measure in incremental way at what point the returned results reflectthe user's information need.

-User context: in addition to the score given by the system, it comes to responding to thequestion, how the user appreciates results. This information is subsequently capitalized ashistoric for reuse it in a future evaluation sessions.

4.1. Evaluation of performance of the search tool

This is the first component of our approach; the evaluation referred to this level is based on anumber of criteria summarizing the problems generally encountered by users during a searchsession. The criteria that we have defined depend on the nature of the manipulated information, of

the source of this information, and finally of the mechanism used to retrieve this information. Thevalues assigned to these criteria are automatically calculated by the system soon obtaining resultsprovided by the search tool. The estimation of these values gives subsequently an overview of thequality of the search tool independently of the relevance of the results that it returns. Thesecriteria are the following:

The redundant results: This involves measuring the ability of the search tool to discard theredundant results. This means that the search tool should return only once the results coming fromthe same site but with different pages.

The dead Links: A dead link is a link that leads to a page that does not exist, that it has beenmoved or deleted. In general, the browser returns in this case the error codes 404. Evaluate thiscriterion consists to underscore the ability of the search tool to detect them.

The parasites pages: They include advertising pages and pages that can identify, for example,only promotional links. These pages provide no useful information to the user and generally makefalse results. Their elimination depends to the performance of the 'crawler' search tools, and hencethe quality of the algorithms used by each search engine.

Response time: This is the time consumed by the search engine to return the query's results; it isone of the most important aspects. More response time is short, better are performance of searchtool.

Query context

User context

System

Evaluation of the relevance ofresults compared to the query

Evaluation of the relevance bythe user's judgments

Evaluation of performance ofthe search tool

User

Search tool

Information

External level Context type Equivalent to be evaluated


7/18


28

4.2. Evaluation of the relevance compared to the query

This is the second part of our contextual evaluation approach; this is the weighting, by increasingthe number of terms, of the query words compared to the words of the returned documents. Thisincludes choosing the weighted terms in the first time, then apply the formula that we propose an

incremental way versus the number of words forming the query.

4.2.1. Weighted terms choice, an incremental weighting

In a process of information retrieval queries are created by the user, it reflects aninformation need, and they are composed of one or more words depending to the necessity tosatisfy the deficiency noted in information, a lacuna or a defect. The groups of words in a queryare often more semantically rich than the words that compose it taken separately, and cantherefore better respond to what users expect.

In our approach, we have chosen to define several hierarchal levels during weighting accordingto the number of words forming the query. Each level is composed of one or more words (a groupof words) starting from the query formulated by the user. The incremental weighting by increase

of query terms instead of a classic weighting of each word separately allows better takeinto consideration the query context during the evaluation. For example, assuming that the querysent by the user is contextual evaluation of information retrieval systems, documents containingthe group of words : contextual evaluation of information or contextual evaluation arecertainly nearest to what the user expect compared to those in which we find the words:contextual, evaluation, information, retrieval or systems taken separately.

4.2.2. Relevance Calculating, a contextual formula

Once the groups of words to be weighted are defined, it comes to assigning a weight thatdetermines their importance in the document. We have therefore developed a weighting formulathat takes into account the context of the query in terms of number of words composing it. Thisformula is inspired from the TF IDF weighting [14] to which we added two dimensions; the

document length and the hierarchy of words groups according to the length of the query. So, it isincremental and is defined as follows:

With:- R: The set of query terms.- R: The terms of the words group to weighted.- W (R, D): The frequency of R' in the document D.- Length (R): the query length.- Length (R): the length of the words group to weighted.- Length (D): Length of the document.- TNRD: Total number of returned documents.- NDWGR: Number of documents containing R'.

4.3. Evaluation of the relevance by the user's judgments

When an IRS returns a document to the user, this one recovers information. This information isimportant for a given user; it is possible that the same information can make a greater or lesser


8/18


29

importance, generate a more or less bright interest depending on the individual and the context ofuse. The information has therefore importance for a given user in a given context and is the userthat determines the actual adequacy of results returned by the search tool with its informationneed. Based on this principle and to allow consideration of the user's judgments during theevaluation, we use an adaptation of our approach proposed in [3] which is to model the user by a

static and dynamic context. The migration of our approach from taking into account the context ininformation retrieval to its consideration in the evaluation process requires a redefinition of theconcept of static and dynamic context to make them usable for evaluation.

4.3.1. Static context

These are the personal characteristics of the user that can influence the research context. Thisinformation is stored in the users context base during the first connection to the system. For thispurpose we have identified four categories of information relating to the static context, thisinformation is summarized in:

- Connection parameters: e-mail and password.- Personal characteristics: name, country, language,...- Interests and preferences: domains, specialty,...- Competence expertise level: profession, level of study,...

After having recovered the static context, the user can formulate his query and the search tooltakes charge to returned suitable results.

4.3.2. Dynamic context

In order to optimize the reuse of the user's judgments and facilitate their understanding, thissecond component of context aims to associate the relevance judgments with the user's context.The principle is as follows; at the end of each search session the recovery of the dynamic contextis performed and this by allowing users to express their judgments of relevance regarding to thedocuments returned by the search tool. This judgment by the user is to vote on a scale from 0 to 5,

where 0 corresponds to a document completely useless or off-topic, 5 corresponding to adocument that responds perfectly to the asked query. The evaluation is activated automaticallywhenever the user expresses a judgment. Finally and based on relevance judgments assigned bythe user, the system recalculates the relevance value of a result and the evaluation of the searchtool is carried out by updating the basis of the user's contexts.

5. APPLICATION OF THE PROPOSED APPROACH TO THE EVALUATION OF

SEARCH ENGINES

To prove the applicability of the proposed approach, we used it to for the contextual evaluation ofsearch engines. Our choice was set on three search engines (Google, Yahoo and Bing). Thischoice is motivated by their popularity in the web community on the one hand, and theeffectiveness of their research nuclei and the degree of coverage that they provide in response to a

request on the other hand.

We therefore propose to set up a system conducting an open search on the web, and perform byfollowing the evaluation of the results returned by each search engine. To this end we use thethree levels of the contextual evaluation approach that we have proposed. This system shouldallow:


9/18


30

1) Make the same set of queries to the three search engines Google, Yahoo and Bing.2) Retrieve the results returned by each search engine;3) Check the informational content of all the resulting pages;4) Capture the user's static and dynamic context for the current search session, and used it for

the evaluation of the results by the user's judgment;

5) Measuring the degree of relevance of results returned by each engine taking into account thecontext of the query by the incremental application of the proposed formula.

6) Diagnose performance, characteristics, specifications and behavior of each search enginetaking into account its context accordance with what has been proposed in the third level ofour approach.

7) Coupling of the relevance scores obtained in the three evaluation levels for each search engineand thus obtained the final evaluation.

The system consists of two main modules: a first module for managing interactions between theuser and the search engine (identification and search), and a second which covers the three levelsof evaluation described in our proposal. These two modules are closely interrelated in the sensethat the outputs of a module are the inputs of the other. We present in the in what follows

modules components the system and we illustrate the functionalities offered by each of them.

5.1. Managing of users / search engine interactions module

We are interested to evaluating of the quality of search engines and the relevance of the resultsthat it returns. A preliminary phase to this evaluation is absolutely necessary, it involves takinginto account the user's information need in the form of a query and then interrogate the searchengine selected to retrieve results to be evaluate. The managing of users/search engineinteractions module supports all interactions between the user and the search engine for theconnection to the system until the results deliverance.

It takes care capturing of the user's static context, managing of its identification, He also managesthe transmission of the user request to the search engine and retrieval of results, and finally, it

communicates these results to the evaluation module. This module consists of twocomplementary processes:

5.1.1. The static context capturing process

The static context previously defined during the presentation of our approach is represented bythe user profile. The latter is the source of knowledge defining all users aspects and which can beuseful for system behavior. The user profile data comprising the static context can be indicated bythe user himself, learned by the system during use or indicated by selecting an existing profilecreated by experts. In our case, we construct the static context of the user at the first connection tothe system. This construction is done by asking the user to fill the four categories of informationdefined previously.

The categorization of users has the advantage of having typical information with the opportunityto refine it as and when. Once the identification made the user can conduct open research on theweb.

5.1.2. The search process

We opted for a system that offers an open search on the web using the following principle; afterconnecting to the system, the user expresses his information need as a query. The researchprocess therefore takes as input the query and gives to the user the ability to choose one of three


10/18


31

search engines that the system proposes (Google, Yahoo, and Bing), the search operation isinitiated by running in parallel the nucleus of each search engine with as only parameter the userquery. The obtained result is finally communicated to the user and the evaluation module. Thisprocess also calculates the response time of each search engine. Figure. 2, shows module formanaging interactions between the user and the search engine and illustrates the operating

principle of its two processes.

Figure 2. Managing of users / search engine interactions module.

5.2. Evaluation module

To precede with the evaluation of the three search engines, the system retrieves the results of eachof them and performs their analysis. The contextual evaluation module consists of three processes

representing the three levels of evaluation of the approach that we propose. These processes arerespectively; a first for the performance evaluation of the search engine, a second process for theautomatic evaluation of the relevance of results returned by this engine and finally a process forthe evaluation of the relevance by the user's judgments. Figure 3, summarizes the evaluationapproach applied to search engines.

5.2.1. The performance evaluation of the search engine process

This process diagnostic performances and characteristics of each search engine based on thecriteria developed in our approach. It takes place according to the following steps:

Extraction of the link list: As soon as the search engine displays the results in response to a userquery, the system automatically retrieves the list of links url related to each result and performs

the appropriate treatment according to the page content.

Detection and calculates of the redundant links: it concerns analysis of the links list to detectthose that are redundant and calculate their number. If there is no redundant link the note will beequal to the number of analyzed links, Otherwise the note decrease by one as much as there areredundant links.

Detection and calculates of the dead links: The detection of such links is by opening aconnection with all the links recovered, if the open operation fails, the link is considered as dead.


11/18


32

For assigning the final note, the principle is the following; calculate the number of dead links andassigns the note of 0 if links are dead and the note of 1 otherwise.

Detection of parasite pages: it includes pages that do not contain at least one of the query termsin the returned results. The detection operation is to calculate the number of occurrences of each

query word in the documents. If the frequency of each word is equal to 0 then the result isconsidered as a parasite page.

Figure 3. Summary of the evaluation approach applied to search engines

5.2.2. The automatic evaluation of the returned results relevance process

This process is interested to the automatic measuring of how the results delivered by the search

engine match the user's information need. Consistent with the approach that we propose itunfolds according to the following steps:

Extraction of the textual content: This operation is carried out from the previously retrieved linkslist. The idea is to open the web page corresponding to each url, and retrieve its textual contentsusing a parser developed for this purpose. We have implemented a parser for each search enginebecause the html tags are different from one engine to another. The extracted content is sent foranalysis.

Incremental weighting of terms: Once the textual content is retrieved, it comes to calculate theoccurrence frequency of the query terms in the different returned documents. The occurrencescalculation for each hierarchy level of the considered query is respected according to the numberof words forming the query. In other words; the n words composing the query are regarded as

one term in the first time and its frequency in each result is calculated, then the 'n-1' querywords become the considered term and the frequency is also calculated, and the operationcontinues until there remains only one word, its frequency is calculated and the incrementalweighting comes to an end. In the case where the user wants to search with the exact expressionthe calculates of frequency is carried out once with the entire query as a term.

The formula application: this stage of the evaluation process is to apply the formula developed inour proposal. This formula takes as input for each level of hierarchy group of words forming the


12/18


33

query: its occurrence frequency, its length, the document length, the query length, the totalnumber of analyzed documents and the number of documents containing that word group. Itproduces as output a weight representing the relevance of the result according to the query.

5.2.3. The evaluation of the relevance by the users judgments process

This process is to engage users in real search situations. In this context is the user's relevance judgment that determines the performance of the search engine. According to our evaluationapproach, each user is characterized by his static context defined at the first login to the system.After login, the system retrieves the judgments made by the user in previous search sessions toupdate the dynamic context of judgments. The latter is constructed progressively at the end ofeach search session by allowing the user to give his opinion after having consulted the returnedresults. Finally and basing on the user's relevance judgments, the process recalculates the resultsrelevance score and the evaluation of the search engine is then updated.

6.RESULTS AND DISCUSSION

6.1. The used protocol

To measure the contribution of our approach to the search engines evaluation, we use anextension of the evaluation scenario proposed in [15].The evaluation was conducted with the helpof 24 students from the second year license STIC (Science and Technology of Information andCommunication) at the Mentouri Constantine University, playing the role of users. The goal wasnot to make an evaluation by experts but by a basic public, reasonably familiar with searchengines. 6 topics were chosen, to reflect diverse fields of use. These topics are: News, Animals,Movies, Health, Sports and Travel. Each topic was assigned to a group of 4 students who chosefreely 5 queries. For example, for the sports topic, the chosen queries were as follows:

- World Cup 2010.- France cycling tour- Formula 1 racing cars- Famous football players- Roland-Garros tournament.

The queries were submitted to different engines, and the first two pages containing the 20 resultswere archived for each query and each search engine. In total, 1800 'url' ware retrieved (6 topics x5 queries x 20 results x 3 search engine) and organized in the form of triplet (Query, url, pagecontent). Finally the set of triples has been communicated to the system for analysis andevaluation.

6.2. Performance of search engines (system context)

We present in Table 1, the obtained scores for the performance evaluation of the three searchengines, and Figure 4, gives a graphical interpretation of these results.


13/18


34

Table 1. Search engines performance evaluation

Search

engines

Performance

Dead

Links

Parasites

Pages

Redundant

Results

Average

Response TimeGoogle 2,03% 5,30 % 4,04% 0,17 Sec

Yahoo 2,13% 10,19 % 4,81% 0,21 Sec

Bing 1,67% 8,64 % 5,32% 0,22 Sec

Figure 4. Search engines performance evaluation.

6.2.1. Results analysis for dead links

The rate of dead links is low; this is explained partly by the fact that the automatic usedprocedure tries up to three attempts separated by a delay of few minutes on failure, and secondlyby the fact that a number of servers do not return the error code 404 Page not found when thepage no longer exists, but a normal HTML page with an ad hoc message, which cannot beinterpreted as an error only by a human reader. We note also that 71% of dead links returned byYahoo and 79% of those returned by Google are caused by the Amazon web site which, forunknown reasons, returned an error code during the experiment. Finally, Bing has got the bestscore with only 1.67% of dead link.

6.2.2. Results analysis for the parasites pages

They were considered as parasites the links referring to the commercial sites offering onlinepurchases or transactions. The obtained scores have been variable depending on the search

engines, and we notice that they have different strategies to exclude the parasite pages. Amongthe commercial sites that appear several times we notice two companies: Amazon and E-Bay.Their association with the different engines is interesting to be study. Google and Yahoo arestrongly associated with Amazon, while Bing prefers Ebay. Overall, it is Google that returns thefewest links to commercial sites with 5.30%.


14/18


35

6.2.3. Results analysis for redundant results

We find that the ability of the three search engines to eliminate the redundant resultsvaried according to the type of queries. The results also showed that the majority of redundant

links returned by Google and Yahoo comes from the use of Wikipedia. Of the 20 analyzedresults, Google returned 4.04% redundant links whose 80% from Wikipedia, and Yahoo 4.81%redundant links whose 78% from Wikipedia. The results also showed that some web site offer alink type named aliases to avoid redundant links. An alias link type is a copy of main link, withthe same URL, but it is not considered by search engines as an attempt to index content abusively.

6.2.4. Results analysis for the average response time

This criterion measures the time consumed by the search engine from the query transmission untilthe results are displayed, it depends heavily on internet connection speed and the power of themachine. To ensure homogeneity when calculating the response time, all queries have been testedon the same machine with the same speed internet connection. The obtained results show that theaverage response time is almost identical in the three search engines. However, we note that

Google top the list with an average speed of 0.17 seconds, this can be explained by the power ofthe PageRank algorithm used by this engine.

6.3. Relevance by the user's judgments (user's context)

We are interested in the relevance judgments given by the user to the first result returned by eachsearch engine (R@01). The latter is of particular importance, since it is the closest link clicked byusers. The 24 students also expressed their relevance judgments for 5, 10, 15 and 20 first retrieveddocuments (R@5, R@10, R@10, R@15, R@20). At each level of relevance, a note of 0-5 wasassigned by each student. 0 corresponding to a document completely useless or off-topic, 5corresponding to a document responding in a perfect way to the question. Table 2, shows theobtained scores.

Table 2. Evaluation of the relevance by the user's judgments

Relevance levelSearch engines

Google Yahoo Bing

R@01 3,15 2,92 2,70R@05 2,79 2,14 2,58R@10 2,34 2,51 2,16R@15 2,00 1,83 1,72R@20 1,91 1,77 1,69

The overall scores obtained by each search engine for the 20 results are extremely low, since nomotor reaches the average note of 2.5 at R@20. The search engine that had the best note of 1.91is Google. The situation is remarkably improved if one considers only the first result R@1; thethree search engines are exceeding the average.


15/18


36

Figure 5. Evaluation of the relevance by the user's judgments

In figure 5, shows the average note according to the level of relevance of results for each searchengine. We find a general decline in perceived relevance depending on the number of theconsidered results, except for Yahoo, which dates back to the average when considering therelevance of the first 10 results R@10, suggesting that the ranking algorithm is not optimal forthis engine, or that the result is disturbed by the merging of commercial web sites.

6.4. Relevance of results according to the query (request context)

Using our formula, we calculate the relevance of the first 20 returned results according to each ofthe 30 queries, and that for the three search engines. A note average for each group of 5 queries inthe same topic was calculated and the obtained score was rounded to a note on 10. The overallresults are summarized in Table 3, and Figure 6, gives a graphical interpretation of these results.

Table 3. Evaluation of the results relevance according to the query

Queries categorySearch engines

Google Yahoo Bing

News R01 R05 6,91 6,77 6,19

Animals R06 R10 5,25 6,13 5,87

Movies R11 R15 5,72 5,13 5,67

Health R16 R20 4,98 4,83 4,66

Sports R21 R25 5,93 5,89 5,16

Travel R26 R30 6,19 6,09 6,10


16/18


37

Figure 6. Evaluation of the results relevance according to the query

The analysis of the results obtained show that the Google search engine ranks first in terms ofrelevance of results according to the query, and that for the 5 query categories of the 6 availablecategories. This finding may be explained by a possible match or an unintended complicitybetween the formula that we proposed and the mechanism Google uses to rank results. We alsonote that the scores of the 'health' category are below average for the three search engines, this isdue to the fact that the queries in this category contain few of words, which decreases the wordsfor which we calculate the number of occurrence and thus weaken the final score.

7.CONCLUSION

In this paper we are interested in proposing a new approach based on the context for evaluatinginformation retrieval systems. A deep investigation of the work done in the field of classicevaluation of this type of system allows us to identify limits and shortcomings encountered during

the evaluation process. We have therefore defined three classes of problems, each class isrelated to an actor that we generally find around the evaluation process. These limits areessentially those related to the absence of the user during the evaluation, those related to therelevance judgments and finally the limits related to the corpus of documents and queries.

Our main contribution consists of the consideration of context during the evaluation at threecomplementary levels. First the context of the system is considered by estimating the ability ofthe search tool to eliminate the dead links, redundant results and parasites pages. In a second levelour approach takes into account the query context based on an incremental formula forcalculating the relevance of the returned results according to the sent query. The last level of theapproach takes into consideration the user's judgments via his static and dynamic context. Finally,a synthesis of the three levels of contextual evaluation was proposed.

The application of the proposed approach to the search engines evaluation was used todemonstrate its applicability for real research tools. This study which is certainly far fromexhaustive, nevertheless gives a snapshot of the search engines performance and the relevancy ofresults that they return. We note also that nothing in this study helps to explain the massive userpreference for the Google search engine because, overall, Google and Yahoo have performanceroughly equivalent. We must therefore assume that the reasons are criteria other than the purerelevance.


17/18


38

Finally, this study paves the way for diverse perspectives; the most important of them is toenlarging the application field of the realized research. It would be interesting to test the proposedapproach to evaluate personalized search tools and enrich the obtained results with searchengines.

REFERENCES[1] L. Tamine, M. Boughanem, and M. Daoud, Evaluation of contextual information retrieval

effectiveness: overview of issues and research, In Journal of Knowledge and InformationSystems. Volume 24 Issue 1, pp. 1-34. Springer, Londres, United Kingdom. July 2010.

[2] D. Menegon, S. Mizzaro, E. Nazzi, L. Vassena, Benchmark evaluation of context-aware Websearch, In Workshop on Contextual Information Access, Seeking and Retrieval Evaluation (Inconjunction with European Conference on Information retrieval - ECIR), Toulouse, France,Springer, April 2009.

[3] A. Bouramoul, M.K. Kholladi, and B.L. Doan, PRESY : A Context based query reformulationtool for information retrieval on the Web, In JCS : Journal of Computer Science, Vol 6, Issue 4,pp. 470-477, 2010., ISSN 1549-3636, New York, USA. April 2010.

[4] P. Brzillon, Making context explicit in communicating objects, In C. Kintzig, G. Poulain, G.Privat, P.-N. Favennec (Eds.): Communicating with Smart Objects. London: Kogan Page Science(Book Chapter 21), pp. 273-284, 2003.

[5] T. Winograd, Architectures for context, human-computer interaction,. L. Erlbaum AssociatesInc. Hillsdale, NJ, USA. Volume 16, pp: 402-419. December 2001.

[6] Belkin, G. Muresan, X. Zhang, Using Users Context for IR Personalization, In Proceedings ofthe ACM/SIGIR Workshop on Information Retrieval in Context 2004.

[7] R. Navigli, P. Velardi, An analysis of ontology-based query expansion strategies,, InProceeding of the Workshop on Adaptive Text Extraction and Mining, Dubrovnik Croatia, 2003.

[8] Y. Tao, N. Mamoulis , D. Papadias, Validity Information Retrieval for Spatio-TemporalQueries: Theoretical Performance Bounds, In Proceedings of the 8th International Symposium onSpatial and Temporal Databases, (SSTD), LNCS 2750. Santorini Island, Greece, July 24-27, 2003.

[9] C.T. Lopes, Context Features and their use in Information Retrieval, In : Third BCS-IRSGSymposium on Future Directions in Information Access. Padua, Italy, September 2009.

[10] H-C. Lin, L-H. Wang, Query expansion for document retrieval based on fuzzy rules and userrelevance feedback techniques, In Expert Systems with Applications, 31(2), 397-405, 2006.

[11] M. Daoud, L. Tamine, M. Boughanem, A contextual evaluation protocol for a session-basedpersonalized search, In Workshop on Contextual Information Access, Seeking and RetrievalEvaluation (In conjunction with European Conference on Information retrieval - ECIR), Toulouse,France, Springer, April 2009.

[12] C. Peters, What happened in CLEF 2009, In Multilingual Information Access Evaluation I. TextRetrieval Experiments. LNCS. Volume 6241. 10th Workshop of the Cross-Language EvaluationForum, CLEF 2009, Corfu, Greece, September 30 - October 2, 2009.

[13] S. Chaudiron, M. Ihadjadenem, Quelle place pour lusager dans lvaluation des SRI ?,, In V.Couzinet, G. Regimbeau. Recherches rcentes en sciences de l'information : Convergences etdynamiques. Actes du colloque international MICS LERASS, 21-22, Toulouse. Paris : Adbsditions. p. 211-232. Mars 2002.


18/18


39

[14] P. Soucy, G.W. Mineau, Beyond TFIDF Weighting for Text Categorization in the Vector SpaceModel, In the Proceedings of the 19th International Joint Conference on Artificial Intelligence(IJCAI 2005), Edinburgh, Scotland, 2005.

[15] J. Vronis, Etude comparative de six moteurs de recherche,. Universit de Provence, 2006.http://sites.univ-provence.fr/veronis/pdf/2006-etude-comparative.pdf.

using context to improve the evaluation of information retrieval systems

Documents