efficient search in large textual collections with redundancy jiangong zhang and torsten suel

6

Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel Review by Newton Alex 993940942

Upload: regan-conway

Post on 31-Dec-2015

31 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

DESCRIPTION

Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel. Review by Newton Alex 993940942. Problem. Searching over collections of data that include many different crawls and versions of each page E.g. Searching the Internet archive, email archives etc. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Efficient Search in Large Textual Collections with Redundancy

Jiangong Zhang and Torsten Suel

Review by

Newton Alex

993940942

Page 2: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Problem

Searching over collections of data that include many different crawls and versions of each page

– E.g. Searching the Internet archive, email archives etc.

Not feasible to provide full text search due to high cost of processing a query

– E.g. Current indexing and query processing techniques when applied to say 10 successive crawls of the same URL will result in index sizes and query processing costs roughly 10 times that of single crawl

Page 3: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Proposed Solution

A new and general framework that results in significant savings in the size of the inverted index and the performance of query processing for webpage collections with redundancies.

Features– Content-dependent partitioning techniques, in particular

Winnowing.– Non redundant indexing. Two policies with respect to

indexing local sharing global sharing

– Modification of Document-at-a-time query processing algorithm to take advantage of the fragment based indexes

Page 4: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Critique

The paper does not described the data structures used or the hardware setup in detail.

The framework supports deleting old unused fragments. Why is a delete required when we are interested in versioned systems?

Since no duplicate fragments are maintained, deleting a fragment might result in removing fragments corresponding to other pages in the archive.

Page 5: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Relation to Course

This paper is similar to the Google News paper. However, this paper doesn’t describe the data structures or the environment setup in detail

Related to the concepts that were used in the Search engine project like inverted indexes, query matching etc.

Proposes methods for creating efficient indexes for redundant data.

Page 6: Efficient Search in Large Textual Collections with Redundancy Jiangong Zhang and Torsten Suel

Session 58 Torsten Bergh

20120711 Est Suel Cordoba Cap 3 Met y Proc

Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark

Pedestrianization of the Historic Peninsula in Istanbul - Esra Suel - EMBARQ Turkey

Distributed Structures for Multi-Hop Networks Rajmohan Rajaraman Northeastern University Partly based on a tutorial, joint with Torsten Suel, at the DIMACS

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk* Torsten Suel CIS Department Polytechnic University Brooklyn,

Torsten Reil - Casual Connect Keynote 2011

Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY

KARSTEN GRABOW | TORSTEN OPPELLAND

Comparative firewall study - Torsten Hoefler, Christian

SCOPE School Dublin - Torsten Olbers

20120711 Est Suel Guajira Cap 5 Gen y Tax

Scheduling CS623, Lecture 7 3/9/2004 © Joel Wein, updated by T. Suel

Torsten Lund - PowerFactory and DSL

Cluster-Based Delta Compression of a Collection of Files Zan Ouyang Nasir Memon Torsten Suel Dimitre Trendafilov CIS Department Polytechnic University

Perspectives 166: Torsten Slama

Inverted Index Compression and Query Processing with Optimized Document Ordering Hao Yan, Shuai Ding, Torsten Suel 1.Department of Computer Science and

Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines

Web Search Engines and Search Technology Torsten Suel Polytechnic University Brooklyn, NY 11201 [email protected] Lecture at Polytechnic University, 4/2/02

Dissertation Torsten Mack

Torsten Petersson, Cicero. A Biography

Torsten Czenskowsky Jochem Piontek - Duncker & Humblot

CS623: Lecture 4 2/17/2004 © Joel Wein 2003, modified by T. Suel

Torsten Feys European University Institute (Florence)

Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 [email protected]@cis.poly.edu

ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam

Mecanica de Suel Bosateria Terra

Torsten Schwarz Herausgeber LEITFADEN DigitaleVorwort Torsten Schwarz: Leitfaden Digitale Transformation / Vorwort Die jährliche Konjunkturumfrage des Mittelstandsverbundes ergibt

Jinru He, Junyuan Zeng, and Torsten Suel Computer Science & Engineering Polytechnic Institute of NYU Improved Index Compression Techniques for Versioned

Torsten Suel Associate Professor CSE Department Polytechnic Institute of NYU suel@poly

Shuai Ding, Jinru He, Hao Yan, Torsten Suel Using Graphics Processors for High Performance IR Query Processing April,23 2009

Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger

[Michael Luck, Torsten Kirstges] Global org

World Outlook Torsten Slok

La Energia Interna de La Tierra Ies Suel