evaluation of alternativeweb.stanford.edu/class/cs276a/projects/reports/qisu-yfun…  · web...

14
Performance Evaluation of Relational Implementations of Inverted Text Index

Upload: others

Post on 17-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluation of Alternativeweb.stanford.edu/class/cs276a/projects/reports/qisu-yfun…  · Web viewWe perform an in-depth comparison of alternative relational implementations of inverted

Performance Evaluation of Relational Implementations of Inverted Text Index

Page 2: Evaluation of Alternativeweb.stanford.edu/class/cs276a/projects/reports/qisu-yfun…  · Web viewWe perform an in-depth comparison of alternative relational implementations of inverted

Qi Su Yu-Shan Fung Stanford University Stanford University [email protected] [email protected]

ABSTRACTInformation retrieval (IR) systems are adept at processing keyword queries over unstructured text. In contrast, relational database management systems (RDBMS) are designed for queries over structured data. Recent work has demonstrated the benefits of implementing the traditional IR system of inverted index in RDBMS, such as portability, parallelism, and scalability. We perform an in-depth comparison of alternative relational implementations of inverted text index versus a traditional IR system.

1. INTRODUCTIONDatabase and information retrieval (IR) are two rich fields of research that have produced ubiquitous tools such as the relational database management system (RDBMS) and the web search engine. However, historically, these two fields have largely developed independently even though they share one overriding objective, management of data.

We know that traditional IR systems do not take advantage of structure of data, or metadata, very well. Conversely, relational database systems tend to have limited support for handling unstructured text. Major database vendors do offer sophisticated IR tools that are closely integrated with their database engines, for example, Oracle Text, IBM DB2 Text Information Extender, and Microsoft SQL Server Full-Text Search. These tools offer a full range of options, from Boolean, to ranked, to fuzzy search. However, each text index is defined over a single relational column. Hence, significant storage overhead is incurred, first by storing the plain text in a relational column, and again by the inverted index built by the text search tool. These tools offer powerful extensions to the traditional relational database, but do not address the full range of IR requirements. Their vendor-specific nature also means they are not portable solutions.

There has been research in the past decade investigating the use of relational databases to build

inverted index-based information retrieval systems. There are several key advantages to such an approach. A pure relational implementation using standard SQL offers portability across multiple hardware platforms, OS, and database vendors. Such a system does not require software modification in order to scale on a parallel machine, as the DBMS takes care of data partitioning and parallel query processing. Use of a relational system enables searching over structured metadata in conjunction with traditional IR queries. The DBMS also provides features such as transactions, concurrent queries, and failure recovery. Most of the previous works have picked one relational implementation and compared it with a special-purpose IR system. Some of them have focused on a particular advantage, such as scalability on a parallel cluster.

We propose a comprehensive evaluation of the alternative relational implementations of inverted text index that have been discussed in literature, with the special-purpose IR system Lucene being the baseline for comparison. We will evaluate the systems on Boolean queries, phrase queries, and relevance ranked queries, and benchmark their relative performance in terms of query response times.

In section 2, we discuss the related work in literature concerning implementing inverted index as relations and integrating IR and DBMS. In section 3, we present the baseline IR and alternative relational implementations of inverted index systems. In section 4, we review evaluation of these systems, our test dataset, and queries to be executed. Section 5 presents the relative performance results collected and our observations. Finally, in section 6, we present concluding remarks and future work.

2. RELATED WORKSeveral works have picked a single relational implementation and compared its performance with a baseline special purpose IR system. Kaufmann et al

Page 3: Evaluation of Alternativeweb.stanford.edu/class/cs276a/projects/reports/qisu-yfun…  · Web viewWe perform an in-depth comparison of alternative relational implementations of inverted

[KS95] compared an IR system, BASISPlus, an early version of Oracle’s text search tool, SQL*TR, and a relational implementation of the inverted list with two relations, <term, docid> and <term, docfreq>. The evaluation dataset is a small 850,000 tuples in the <term, docid> inverted list. The queries are strictly conjunctive Boolean queries.

More recent works have shown that Boolean, proximity, and vector space ranked model searching can be effectively implemented as standard relations while offering satisfactory performance when compared to a baseline traditional IR system. Grossman et al [GFH97] demonstrates that relational implementations are effective for Boolean, proximity, and ranked queries. The relational model implemented using Microsoft SQL Server consists of doc_term table <docid, term, term freq>, doc_term_prox table <docid, term, position>, and idf table <term, idf>. The baseline IR system is Lotus Notes, which is a heavy weight system that is not built specifically for IR tasks. The authors also studied parallel performance on an AT&T 4-processor database machine.

Some works have focused on a single advantage of relational implementations over traditional IR inverted index. Grabs et al [GBS01] evaluated the performance and scalability of a database IR system on a parallel cluster. The system is implemented with BEA middleware over Oracle database, with significant emphasis on the transaction semantics to ensure high levels of search and insert parallelism. The basic data model is <term, docid>. Only Boolean queries performances are measured.

Brown et al [BCC94, BR95] demonstrated efficient inverted index implementation and fast incremental index update using a database system. However, their implementation used a persistent object store manager, which is beyond our scope of using the traditional relational model and off-the-shelf RDBMS.

A recent issue of IEEE Data Engineering Bulletin covered the work by major database vendors to integrate full text search functionality into the RDBMS. [MS01], [HN01], [DIX01] presented how IBM DB2, Microsoft SQL Server, and Oracle introduce text extensions that are tightly coupled with the database engine. However, as we discussed

earlier, such an approach is limited in that each text index must be defined over a single column, and storing both the full text of the document in the database, as well as storing the inverted index on the side, incurs significant storage overhead.

3. SYSTEM IMPLEMENTATIONSWe evaluate four systems on information retrieval tasks. The baseline system is Lucene, a special-purpose IR search engine. The three relational designs are implemented using IBM DB2 Universal Database. The first relational approach uses the DB2 Text Information Extender to take care of all the indexing and query processing. The two remaining relational approaches implement the inverted index as relations, and transform keyword queries to standard SQL queries.

3.1. LuceneLucene is an open-source text search engine system under the Apache project. It is written entirely in Java. We chose this as our baseline system as it offers ease of deployment and a full feature set representative of a traditional IR system.

Lucene includes three key APIs, IndexWriter, IndexReader, and IndexSearcher. IndexWriter enables the user to construct an inverted text index over a corpus of documents. Indexing may be customized with parameters such as case folding, stemming, etc. The IndexReader allows the user to probe the contents of the inverted index, for example, enumerating all tokens in the index. This important aspect will be discussed in section 4.1. The IndexSearcher provides a rich set of search options, including Boolean, phrase, ranked, and fuzzy search queries.

With our corpus, we will pass one document at a time to our IndexWriter instance to be tokenized and indexed. The keys associated with document are the document ID and URL, which is just the document file name. At execution time, we use the appropriate method of the IndexSearcher instance to retrieve the relevant document URLs from the Lucene index.

By default, all Lucene search hits are returned with a ranked score. For our case, we only care about the ranking when we measure the ranked query performance. Lucene keyword queries are structured as a single search string. And queries prefix each keyword with a plus sign. Or queries are a space

Page 4: Evaluation of Alternativeweb.stanford.edu/class/cs276a/projects/reports/qisu-yfun…  · Web viewWe perform an in-depth comparison of alternative relational implementations of inverted

delimited list of keywords. Phrase queries are a space delimited list of keywords enclosed in quotes.

3.2. IBM DB2 Text Information ExtenderIBM DB2 Text Information Extender (TIE) is a full-text search engine tightly coupled with the IBM DB2 Universal Database. It supports the creation of full-text indexes on textual DB2 table columns. TIE uses the table primary key to relate inverted index token entries to their original source tuple in the table. TIE is invoked as special function calls over columns, much like an user-defined function.

We create a three column relation named fulltext <docid, url, text>. The text column is of the type Binary Large Object (BLOB) and contains the full text of the document. Our simple parser creates a single large load file. DB2’s load utility batch loads the data into the relation. Then we invoke TIE to index the text column.

Boolean and phrase queries use the contains function. Ranked queries also use the score function.And query:SELECT url FROM fullWHERE contains (text, ‘”keyword1” & “keyword2” & “keyword3”’)=1

Or query:SELECT url FROM fullWHERE contains (text, ‘”keyword1” | “keyword2” | “keyword3”’)=1

Phrase query:SELECT url FROM fullWHERE contains (text, ‘”keyword1 keyword2 keyword3”’)=1

Ranked query:WITH temptable (url, score) AS (SELECT url, score(text, ‘“keyword1” & “keyword2” & “keyword3”’) FROM fulltext)SELECT url FROM temptableWHERE score>0ORDER BY score DESC

3.3. Term - DocumentIn this representation, each tuple corresponds to a term/document pair. The relations are:tf <term, docid, termfreq, positionlist>idf<term, idf>url<url, docid>

The bulk load files are created by invoking the Lucene IndexReader to probe the Lucene text index over our corpus. We will discuss this aspect further in sec. 4.

The positionlist attribute of the tf relation is an offset encoded list of all occurrence positions of the given term in the given document. For example, term “hello” appearing in document 2 at positions 10, 100, 102 would be encoded as the tuple<”hello”, 2, “10,90,12”>

This representation is more compact compared to the Term-Document-Position approach and is well suited for Boolean and ranked queries. In the case of phrase or positional queries, we implement application logic to merge position lists.

There are two alternatives to implement an AND query in SQL. The natural choice is to translate an N-word query into an N-way equi-join on the document ID, where each join relation has been filtered to select documents containing one of the words.SELECT u.url FROM url u, tf t1, tf t2WHERE u.docid=t1.docid AND t1.docid=t2.docid AND t1.term=’keyword1’ AND t2.term=’keyword2’

Figure 1. Query Plan for equi-join And query

An alternative due to Grossman [GFH97] treats the query keywords as an artificial relation, and joins it with the term-document relation. The result is subject to group by aggregation by document and only documents having the correct number of keyword matches are preserved.WITH query(term) AS (values ('keyword1'), ('keyword2'))

URL TF

TFTablescan Indexscan on term

Indexscan on term

HashJoin

HashJoin

Page 5: Evaluation of Alternativeweb.stanford.edu/class/cs276a/projects/reports/qisu-yfun…  · Web viewWe perform an in-depth comparison of alternative relational implementations of inverted

SELECT u.url FROM url u, tf d, query q WHERE u.docid=d.docid AND d.term=q.termGROUP BY u.url HAVING count(d.term)=2

However, upon further evaluation (results not shown), we discovered that the second implementation never outperforms the first, and in some cases performs over an order of magnitude worse. Hence, from now on, all reference to And type queries on both the Term-Doc and Term-Doc-Position approaches refer to the first implementation.

Figure 2. Query Plan for alternative And query

The OR query is a simple selection with multiple Or filters.SELECT DISTINCT(u.url) FROM url u, tf t WHERE u.docid=t.docid AND( t.term='keyword1' OR t.term='keyword2' )

For phrase queries, we retrieve all candidate documents, and the position list for all the keywords in the query. Candidate documents are the results of the AND query with the same keywords. The result relation looks like <docid, position list for first keyword, position list for second keyword, … >. SELECT u.url, t1.positionlist, t2.positionlist FROM url u, tf t1, tf t2WHERE u.docid=t1.docid AND t1.docid=t2.docid AND t1.term=’keyword1’ and t2.termid=’keyword2’

The application logic then traverses the multiple position lists together to find an instance where the positions in each list are one apart, in order.

Our ranked query implementation is also due to Grossman [GFH97]. First, we precompute term IDF values as log ( number of documents / document frequency of the term ). The relevance of a given document d is the summation over all terms t occurring in both the query and the document:∑ (query.termfreq for t * t’s IDF * d.termfreq for t * t’s IDF).WITH query(term, tf) AS (values ('keyword1',1),('keyword2',1)) SELECT u.url, SUM (q.tf * i.idf * d.freq * i.idf) as score FROM url u, query q, tf d, idf i WHERE u.docid=d.docid AND q.term = i.term AND d.term = i.term GROUP BY u.url ORDER BY score DESC

3.4. Term - Document - PositionThe term-document-position approach stores a single tuple for every single occurrence of a term in a document. Hence the term t appears 5 times in document d corresponds to five distinct tuples. The relations are:posting <term, docid, position>idf<term, idf>url<url, docid>

For Boolean and ranked queries, this representation is redundant, compared to the term-document approach. At query time, we must insert the distinct operator in our query plan to eliminate duplicates in our join results. However, this representation leads to straightforward SQL translation of phrase and proximity queries. There is no need for application logic or custom user-defined functions to post-process position lists for positional matches. Positional matches are specified as SQL arithmetic predicates relating the position attributes.

The Boolean queries are very similar to the Term-Document approach, except for the addition of distinct operators.Equi-join And:SELECT DISTINCT(u.url) FROM url u, posting t1, posting t2 WHERE u.docid=t1.docid AND t1.docid=t2.docid AND t1.term='keyword1' AND t2.term='keyword2'

Or:SELECT DISTINCT(u.url) FROM url u, posting d WHERE u.docid=d.docid AND ( d.term='keyword1' OR d.term='keyword2' )

URL

Query TF

Tablescan

Indexscan on term

Hashjoin

Tablescan

NLJoinon term

Sort, Tablescan, Groupby, Filter

Page 6: Evaluation of Alternativeweb.stanford.edu/class/cs276a/projects/reports/qisu-yfun…  · Web viewWe perform an in-depth comparison of alternative relational implementations of inverted

Phrase queries:SELECT distinct(u.url) FROM url u, posting t1, posting t2 WHERE t1.docid=u.docid AND t1.docid=t2.docid AND t1.term='keyword1' AND t2.term='keyword2' AND t2.position=t1.position+1

Ranked queries:WITH query(term, tf) AS (values ('keyword1',1),('keyword2',1)) SELECT u.url, SUM (q.tf * i.idf * i.idf) as scoreFROM url u, query q, posting d, idf i WHERE u.docid=d.docid AND q.term = i.term AND d.term = i.term GROUP BY u.url ORDER BY score DESC

4. SYSTEM EVALUATIONThe systems are implemented on a Pentium III 800MHz workstation with 1GB of RAM running Windows 2000. The baseline IR system is Lucene version 1.2 running on JDK 1.3. The relational database is IBM DB2 UDB Enterprise Edition version 7.2. Our first relational approach uses the IBM DB2 Text Information Extender version 7.2.

4.1. DatasetOur dataset consists of 199,932 Reuters newswire articles from the year 1997. The raw text is 322MB. The corpus has 895308 distinct tokens after case folding. There are 28,507,457 distinct term-document pairs, which is the cardinality of the term-document relation tf. There are 51,108,145 tokens in the corpus, which is the cardinality of the term-document-position relation posting.

The Reuter documents are wrapped in XML. We strip the XML tags to produce the textual body, which are loaded into the Lucene index and the DB2 TIE relation. DB2 TIE’s tokenization process is a black box to us, the end user. However, by default, queries do use case folding. Fancier features such as stemming and thesaurus can be specified in the search function invocation. Using Lucene, we can specify options of case folding, stemming, etc at index creation and search time. For a uniform comparison of our four alternatives, we want uniform tokenization. Since DB2 TIE has case folding enabled by default, we use it as our common denominator. We build the Lucene index with case folding turned on. To produce uniform tokenization in our two relational implementations, the term-document and term-

document-position relations are populated from inverted index probe of the Lucene index. We use the Lucene IndexReader class to enumerate all term-document-position information and produce the appropriate bulk load file for our two relational representations. Such an approach guarantees that we have uniform tokenization using only case-folding feature across our four implementations.

Table 1 Space utilization

Table Index TotalRaw text 100%Lucene 133% 133%DB2 TIE 99%Term-Doc 337% 389% 726%Term-Doc-Pos 429% 783% 1212%

DB2’s table size estimation utility could not estimate the size of the fulltext base relation because the text body is stored in a BLOB attribute, rather than inline as varchar.

4.2. QueriesWe test Boolean (And/Or), phrase and ranked queries. Our queries are 1, 2 or 4 keywords long. We divide each query class into subclasses of 3 different selectivities of approximately 1 document hit (0.0005% of corpus), 10 hits (0.005% of corpus), and 100 hits (0.05% of corpus). For each subclass, we generate three distinct queries and measure the average query execution time.

Table 2 Sample of Queries Executed and Selectivities (expected number of hits)

hits AND OR Phrase Rank1 Gwil

Industries (1)

Gaulle Quebec (1)

Zygo Systems(1)

PhotronicsYomazzo (6)

VideoserverTandberg (4)

GenholdFemco (2)

ExceedingConsensus (2)

West life (1)

ScotlandBancorp (1)

PhotronicsYomazzo (6)

VideoserverTandberg (4)

GenholdFemco (2)

Queries are generated by sampling sets of 1, 2, or 4 keywords from the headlines of the first thousand documents in the corpus, then repeatedly probing the Lucene index until the desired selectivity for the query subclass is achieved. For the 4 keyword Or queries, we were unable to generate queries of

Page 7: Evaluation of Alternativeweb.stanford.edu/class/cs276a/projects/reports/qisu-yfun…  · Web viewWe perform an in-depth comparison of alternative relational implementations of inverted

selectivity 1 or 10, due to the unionization semantics of disjunctive queries.

We want to measure a uniform response time of keyword queries. Our standard is execution time between the when the keywords are submitted by the user, to when all result document URLs have been returned to the user. For the Lucene implementation, we measure the time between when our Java Lucene IndexSearcher instance receives the command line keyword parameters and search option, to when it completes retrieval of hits from the index. For our three database implementations, we build an embedded SQL application that takes in command line keyword parameters and search options, translates the query into appropriate SQL, connects to the database, executes the query and retrieves document URLs. The execution time of the embedded SQL application is measured.

5. RESULTS & OVSERVATIONS[The following abbreviations are used in the tables:TIE: IBM DB2 Text Information ExtenderTD: Term-Doc relational modelTDP: Term-Doc-Pos relational model ]

Table 3 And Queries Class Average Execution Time in seconds

Lucene TIE TD TDP1 word - 1 hit 0.359 1.094 0.406 0.4481 word - 10 hit 0.729 1.432 0.813 0.4481 word - 100 hit 3.391 4.474 2.781 0.6612 word - 1 hit 0.375 0.557 1.250 0.6512 word - 10 hit 0.578 4.343 1.922 0.6622 word - 100 hit 2.078 4.900 2.078 0.8024 word - 1 hit 0.526 1.396 1.063 2.0834 word - 10 hit 0.739 2.380 2.125 2.5734 word - 100 hit 2.954 3.895 3.870 4.453

Table 4 Or Queries Class Average Execution Time in seconds

Lucene TIE TD TDP1 word - 1 hit 0.359 1.094 0.406 0.4481 word - 10 hit 0.729 1.432 0.813 0.4481 word - 100 hit 3.391 4.474 2.781 0.6612 word - 1 hit 0.385 0.343 0.442 112.72 word - 10 hit 0.719 0.446 0.469 113.52 word - 100 hit 2.500 0.501 0.693 115.94 word - 1 hit4 word - 10 hit4 word - 100 hit 3.016 0.922 0.805 120.5

Table 5 Phrase Queries Class Average Execution Time in seconds

Lucene TIE TD TDP1 word - 1 hit 0.359 1.094 0.406 0.4481 word - 10 hit 0.729 1.432 0.813 0.4481 word - 100 hit 3.391 4.474 2.781 0.6612 word - 1 hit 0.391 5.482 0.406 0.8182 word - 10 hit 0.614 6.774 0.813 0.9062 word - 100 hit 2.552 4.067 2.781 2.3544 word - 1 hit 0.578 1.453 1.359 2.5414 word - 10 hit 0.771 3.104 2.833 12.684 word - 100 hit 3.047 3.562 3.458 13.52

Table 6 Rank Queries Class Average Execution Time in seconds

Lucene TIE TD TDP1 word - 1 hit 0.359 0.635 0.492 ∞*1 word - 10 hit 0.729 1.099 0.526 ∞*1 word - 100 hit 3.391 2.934 0.708 ∞*2 word - 1 hit 0.385 0.339 13.55 257.62 word - 10 hit 0.719 0.370 13.47 284.62 word - 100 hit 2.500 0.432 16.19 261.44 word - 1 hit4 word - 10 hit4 word - 100 hit 3.016 0.523 15.06 307.4* did not finish in 10 minutes

5.1 LuceneLucene, being a specialized IR system, performs consistently well across the board on all four types of queries. It has fast response time, and scales well over the query size. However, performance deteriorates slightly as the expected result size increases.

5.2 IBM DB2 Text Information Extender (TIE)TIE produced performance numbers comparable to that of Lucene on And queries, slightly better on Or queries, and slightly worse on Phrase queries. We are not able to make direct comparison on their Rank query performance as TIE returns only documents containing all query terms whereas the 3 other systems requires only 1 keyword match. However, judging from its performance on single-keyword queries, its response time is comparable to that of Lucene. Furthermore, TIE response time varies significantly across different queries from the same query class, a characteristic not seen in the other systems.

5.3 Term-Doc

Page 8: Evaluation of Alternativeweb.stanford.edu/class/cs276a/projects/reports/qisu-yfun…  · Web viewWe perform an in-depth comparison of alternative relational implementations of inverted

The system using the Term-Doc relational model performed comparably with Lucene and TIE on And, Or and Phrase queries. It also appears to scale well (sub-linearly) on query and result size over those same types of queries. It also performs competitively on single-keyword ranked queries, but performance degrades significantly on ranked queries with 2 or more keywords. This is apparently due to the optimizer choosing a different query plan on these queries. However, performance still appears to scale gracefully as query and result size increase.

5.4 Term-Doc-PosThe system based on the Term-Doc-Pos relational model produced reasonably fast response time on And and Phrase queries, though efficiency deteriorates notably with the increase in the expected number of hits as well as the query size. The system was much slower than the rest on Or and Rank type queries. In fact, with single-keyword ranked queries, the system does not respond within 10 minutes. Upon further investigation, it was found that the query optimizer was picking some unreasonably bad plans, involving multiple sorts of large relations. We plan to look into improvements in these areas in the future. Even with the downfall, this approach seems to scale reasonably well and is quite insensitive to the query or result sizes.

5.5 Comparisons[The following abbreviations are used in the charts:DB2text: IBM DB2 Text Information ExtenderDB2term_doc: Term-Doc relational modelDB2term_doc_pos: Term-Doc-Pos relational model ]All four systems perform well on And queries (Figure 3), with response times no more than 5 seconds. It should be noted that both the Term-Doc and Term-Doc-Pos models produced numbers comparable to that of Lucene. In fact, both appear to scale more gracefully on increasing size of the result set.

Figure 3. And query with 2 keywords

With regard to Phrase queries, all four systems finished within 7 seconds, with all except TIE producing sub-3 second response times (Figure 4). Note that both TIE and Term-Doc exhibits good scaling characteristics, as compared to Lucene and Term-Doc-Pos systems.

Figure 4. Phrase query with 4 keywords

With Rank queries, we see where the two specialized systems have their advantages (Figure 5, note the logarithmic scale). Lucene and TIE significantly outperform the two systems built on relational models. One reason behind the disparity is that the relational system performs extensive sorting operations on large relations. We also discovered that the DB2 query optimizer could have picked better plans if more indices/constraints were available, and we plan to investigate this in the future.

Page 9: Evaluation of Alternativeweb.stanford.edu/class/cs276a/projects/reports/qisu-yfun…  · Web viewWe perform an in-depth comparison of alternative relational implementations of inverted

Figure 5. Rank query with 2 keywords

6. CONCLUSIONS AND FUTURE WORKThe Term-Doc representation offers competitive performance on Boolean and phrase queries compared with special-purpose IR system Lucene. Hence one may choose to incur the storage overhead to gain the advantages of a relational implementation, such as portability, parallelism, and the ability to query over both unstructured text and structured metadata. In general, the Term-Doc representation offers better performance than the Term-Doc-Position representation. DB2 TIE provides comparable performance as Lucene, though it incurs significantly higher space overhead by storing the base text in a relation, as well as the inverted index. If a workload consists mostly of ranked queries, then Lucene or TIE should be used, as the DB2 optimizer seems to be using sub-optimal plans for the two relational implementations.

There are a number of avenues for future work. Additional query classes to be investigated include proximity queries and wild-card queries. We measured the query execution times in isolation. We may want to measure executions of sustained query workloads. Another important question is index update performance. It is conceivable that RDBMS page layout and B-tree index may be more efficient for inverted index insertion than the traditional IR approach of maintaining a small update list and reorganizing the entire index periodically. Traditional IR search engines are not well suited for high-insert environments. We would like to find out if RDBMS approaches are more attractive in such a setting. On the performance front, a number of our RDBMS

queries were clearly being executed on sub-optimal plans, hence some amount of database tuning may result in significant performance boost. Since most IR systems are used interactively, and users typically process result hits in batches, it is often useful to optimize for top-K results. Most database vendors provide language constructs to specify this constraint and to utilize it in query processing for better efficiency (e.g. reducing the size of sort results).

7. REFERENCES[BCC94] E. W. Brown, J. P. Callan, and W. B. Croft. Fast Incremental Indexing for Full-Text Information Retrieval. In Proceedings of the 20th International Conference on Very Large Databases, 1994.

[BR95] E. W. Brown. Execution Performance Issues in Full-Text Information Retrieval. Ph.D. Thesis, University of Massachusetts, Amherst, 1995.

[DDS95] S. DeFazio, A. Daoud, L. Smith, J. Srinivasan, B. Croft, and J. Callan. Integrating IR and RDBMS using cooperative indexing. In Proceedings of the 18th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1995.

[DIX01] P. Dixon. Basics of Oracle Text Retrieval. IEEE Data Engineering Bulletin, December 2001.

[GBS01] T. Grabs, K. Böhm, and H.-J.Schek. PowerDBIR: Information Retrieval on Top of a Database Cluster. In Proceedings of 10th ACM International Conference on Information and Knowledge Management, 2001.

[GFH97] D. A. Grossman, O. Frieder, D. O. Holmes, and D. C. Roberts. Integrating structured data and text: A relational approach. In Journal of the American Society for Information Science, 1997.

[HN01] J. Hamilton, and T. Nayak. Microsoft SQL Server Full-Text Search. IEEE Data Engineering Bulletin, December, 2001.

[KS95] H. Kaufmann, and H.-J. Schek. Text Search Using Database Systems Revisited - Some Experiments. In Proceedings of the 13th British National Conference on Databases, 1995.

[LFH99] C. Lundquist, O. Frieder, D. O. Holmes, and D. A. Grossman. A Parallel Relational Database Management System Approach to Relevance Feedback in Information Retrievel. In Journal of the American Society for Information Science, 1999.

[LS88] C. A. Lynch, and M. Stonebraker. Extended User-Defined Indexing with Application to Textual Databases. In Proceedings of the 14th International Conference on Very Large Databases, 1988.

[MS01] A. Maier, and D. Simmen. DB2 Optimization in Support of Full Text Search. IEEE Data Engineering Bulletin, December 2001.

[RAG01] P. Raghavan. Structured and Unstructured Search in Enterprises. IEEE Data Engineering Bulletin, December 2001.