lectio praecursoria: search interfaces on the web: querying and characterizing, 12.06.2008

Department of Information Technology, University of TurkuTurku Centre for Computer Science

Search Interfaces on the Web:Querying and Characterizing

Lectio Praecursoria12.06.2008

Denis [email protected]

Lectio Praecursoria 12.06.2008 2

Background

• Search engines (e.g., Google) do not crawl and index a significant portion of the Web

• The information from non-indexable part of the Web cannot be found and accessed via searchers

• Important type of web content which is badly indexed:

• web pages generated based on parameters provided by users via search interfaces

• Filling out a search form is a hard task for any automatic agent (e.g., search engines’ robots)


• The part of the Web ’behind’ search interfaces is known as deep Web (or hidden Web)

• Search interfaces are entry-points to myriads of databases on the Web

• The central problem:• High-quality and publicly available data

stored in a huge number of databases is available only via search interfaces (to access a database of interest, a user has to know location of its search interface)

• Web pages in the deep Web (so called data-rich pages) contain blocks of structured information (in contrast to ordinary web pages which are typically unstructured)

Background


Example of a search interface & search results

AutoTrader search form (http://autotrader.com/):


• Number of web databases:• Survey in April 2004: 450 000 web databases (and this is

underestimated value)

• Size of the deep Web:• Survey of 2001: 400 to 550 times larger than the indexable

Web; but it is not that big• No other reliable estimates of the entire size exist• According to my own indirect assessments: comparable

with the size of the indexable Web

• Content of some web databases is, in fact, indexable:

• No reliable estimates but one can expect one fourth is indexed

• Correlation with database subjects: content of books/movies/music databases (relatively ’static’ data) is indexed well

• But, even if known to searchers, data is often outdated

Deep Web: numbers & misconceptions


• Approach to automate querying and retrieving information behind search interfaces

• Essential in case of complex queries• A form query language that allows to

formulate queries and extract useful information from the pages with results

• A prototype system for querying web databases

Thesis contributions: querying search interfaces


• Previous surveys are based on study of deep web resources mainly in English

• Two new methods for characterizing the deep Web

• Two surveys of one national (Russian) segment of the Web

• Dataset describing more than 200 web databases (statistically reliable)

Thesis contributions: characterization of the deep Web


• For any given topic there are too many web databases with relevant content: discovery automation is required

• A system for finding and classifying search interfaces

• Intended for:• Deep Web characterization studies• Building directories of web databases

• Deal with Javascript-rich and non-HTML search forms (these types of forms are ignored in almost all other approaches to the deep Web)

Thesis contributions: finding web databases


Applications• Web search engines:

• Eager to improve their coverage of the Web• In April 2008 Google announced they were

experimenting with their form crawler (hence, most likely, other searchers would also have it tested/implemented/etc. in their robots within 2008-2009)

• Information owners and providers• Typically want to disseminate their (publicly-

available) information• Interest in discovery methods as they want their

resources to be discovered and searched• Vertical/topical search engines

• Find information on a specialized topic• Need methods to extract data from relevant

resources and aggregate it


Future work

• The most promising direction: discovery of web databases

• The goal: building a relatively complete directory (Yahoo!-like) of databases on the Web

• Specialized directories already exist• Several ‘universal’ directories (e.g.,

completeplanet.com) also exist but, as reported, are outdated and cover only a small portion of deep web resources

• Due to the huge number of existing web databases, building and then maintaining such a directory would require automatic methods (discovery, classification, etc.)