efficient search in large textual collections with redundancy jiangong zhang and torsten suel

6
Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel Review by Newton Alex 993940942

Upload: regan-conway

Post on 31-Dec-2015

31 views

Category:

Documents


0 download

DESCRIPTION

Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel. Review by Newton Alex 993940942. Problem. Searching over collections of data that include many different crawls and versions of each page E.g. Searching the Internet archive, email archives etc. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Efficient Search in Large Textual Collections with Redundancy

Jiangong Zhang and Torsten Suel

Review by

Newton Alex

993940942

Page 2: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Problem

Searching over collections of data that include many different crawls and versions of each page

– E.g. Searching the Internet archive, email archives etc.

Not feasible to provide full text search due to high cost of processing a query

– E.g. Current indexing and query processing techniques when applied to say 10 successive crawls of the same URL will result in index sizes and query processing costs roughly 10 times that of single crawl

Page 3: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Proposed Solution

A new and general framework that results in significant savings in the size of the inverted index and the performance of query processing for webpage collections with redundancies.

Features– Content-dependent partitioning techniques, in particular

Winnowing.– Non redundant indexing. Two policies with respect to

indexing local sharing global sharing

– Modification of Document-at-a-time query processing algorithm to take advantage of the fragment based indexes

Page 4: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Critique

The paper does not described the data structures used or the hardware setup in detail.

The framework supports deleting old unused fragments. Why is a delete required when we are interested in versioned systems?

Since no duplicate fragments are maintained, deleting a fragment might result in removing fragments corresponding to other pages in the archive.

Page 5: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Relation to Course

This paper is similar to the Google News paper. However, this paper doesn’t describe the data structures or the environment setup in detail

Related to the concepts that were used in the Search engine project like inverted indexes, query matching etc.

Proposes methods for creating efficient indexes for redundant data.

Page 6: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel