Download - Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizing, 12.06.2008
Department of Information Technology, University of TurkuTurku Centre for Computer Science
Search Interfaces on the Web:Querying and Characterizing
Lectio Praecursoria12.06.2008
Denis [email protected]
Lectio Praecursoria 12.06.2008 2
Background
• Search engines (e.g., Google) do not crawl and index a significant portion of the Web
• The information from non-indexable part of the Web cannot be found and accessed via searchers
• Important type of web content which is badly indexed:
• web pages generated based on parameters provided by users via search interfaces
• Filling out a search form is a hard task for any automatic agent (e.g., search engines’ robots)
Lectio Praecursoria 12.06.2008 3
• The part of the Web ’behind’ search interfaces is known as deep Web (or hidden Web)
• Search interfaces are entry-points to myriads of databases on the Web
• The central problem:• High-quality and publicly available data
stored in a huge number of databases is available only via search interfaces (to access a database of interest, a user has to know location of its search interface)
• Web pages in the deep Web (so called data-rich pages) contain blocks of structured information (in contrast to ordinary web pages which are typically unstructured)
Background
Lectio Praecursoria 12.06.2008 4
Example of a search interface & search results
AutoTrader search form (http://autotrader.com/):
Lectio Praecursoria 12.06.2008 5
• Number of web databases:• Survey in April 2004: 450 000 web databases (and this is
underestimated value)
• Size of the deep Web:• Survey of 2001: 400 to 550 times larger than the indexable
Web; but it is not that big• No other reliable estimates of the entire size exist• According to my own indirect assessments: comparable
with the size of the indexable Web
• Content of some web databases is, in fact, indexable:
• No reliable estimates but one can expect one fourth is indexed
• Correlation with database subjects: content of books/movies/music databases (relatively ’static’ data) is indexed well
• But, even if known to searchers, data is often outdated
Deep Web: numbers & misconceptions
Lectio Praecursoria 12.06.2008 6
• Approach to automate querying and retrieving information behind search interfaces
• Essential in case of complex queries• A form query language that allows to
formulate queries and extract useful information from the pages with results
• A prototype system for querying web databases
Thesis contributions: querying search interfaces
Lectio Praecursoria 12.06.2008 7
• Previous surveys are based on study of deep web resources mainly in English
• Two new methods for characterizing the deep Web
• Two surveys of one national (Russian) segment of the Web
• Dataset describing more than 200 web databases (statistically reliable)
Thesis contributions: characterization of the deep Web
Lectio Praecursoria 12.06.2008 8
• For any given topic there are too many web databases with relevant content: discovery automation is required
• A system for finding and classifying search interfaces
• Intended for:• Deep Web characterization studies• Building directories of web databases
• Deal with Javascript-rich and non-HTML search forms (these types of forms are ignored in almost all other approaches to the deep Web)
Thesis contributions: finding web databases
Lectio Praecursoria 12.06.2008 9
Applications• Web search engines:
• Eager to improve their coverage of the Web• In April 2008 Google announced they were
experimenting with their form crawler (hence, most likely, other searchers would also have it tested/implemented/etc. in their robots within 2008-2009)
• Information owners and providers• Typically want to disseminate their (publicly-
available) information• Interest in discovery methods as they want their
resources to be discovered and searched• Vertical/topical search engines
• Find information on a specialized topic• Need methods to extract data from relevant
resources and aggregate it
Lectio Praecursoria 12.06.2008 10
Future work
• The most promising direction: discovery of web databases
• The goal: building a relatively complete directory (Yahoo!-like) of databases on the Web
• Specialized directories already exist• Several ‘universal’ directories (e.g.,
completeplanet.com) also exist but, as reported, are outdated and cover only a small portion of deep web resources
• Due to the huge number of existing web databases, building and then maintaining such a directory would require automatic methods (discovery, classification, etc.)