lectio praecursoria: search interfaces on the web: querying and characterizing, 12.06.2008

10
Department of Information Technology, University of Turku Turku Centre for Computer Science Search Interfaces on the Web: Querying and Characterizing Lectio Praecursoria 12.06.2008 Denis Shestakov [email protected]

Upload: denis-shestakov

Post on 06-May-2015

140 views

Category:

Technology


2 download

DESCRIPTION

Lectio Praecursoria on my PhD dissertation titled "Search Interfaces on the Web: Querying and Characterizing" given in ICT building, Turku, Finland on June 12, 2008 Thesis contributions: * Querying search interfaces * Deep Web characterization * Finding web databases The text of thesis is available at http://www.slideshare.net/denshe/shestakov2008-search-interfacesonthewebqueryingandcharacterizing

TRANSCRIPT

Department of Information Technology, University of TurkuTurku Centre for Computer Science

Search Interfaces on the Web:Querying and Characterizing

Lectio Praecursoria12.06.2008

Denis [email protected]

Lectio Praecursoria 12.06.2008 2

Background

• Search engines (e.g., Google) do not crawl and index a significant portion of the Web

• The information from non-indexable part of the Web cannot be found and accessed via searchers

• Important type of web content which is badly indexed:

• web pages generated based on parameters provided by users via search interfaces

• Filling out a search form is a hard task for any automatic agent (e.g., search engines’ robots)

Lectio Praecursoria 12.06.2008 3

• The part of the Web ’behind’ search interfaces is known as deep Web (or hidden Web)

• Search interfaces are entry-points to myriads of databases on the Web

• The central problem:• High-quality and publicly available data

stored in a huge number of databases is available only via search interfaces (to access a database of interest, a user has to know location of its search interface)

• Web pages in the deep Web (so called data-rich pages) contain blocks of structured information (in contrast to ordinary web pages which are typically unstructured)

Background

Lectio Praecursoria 12.06.2008 4

Example of a search interface & search results

AutoTrader search form (http://autotrader.com/):

Lectio Praecursoria 12.06.2008 5

• Number of web databases:• Survey in April 2004: 450 000 web databases (and this is

underestimated value)

• Size of the deep Web:• Survey of 2001: 400 to 550 times larger than the indexable

Web; but it is not that big• No other reliable estimates of the entire size exist• According to my own indirect assessments: comparable

with the size of the indexable Web

• Content of some web databases is, in fact, indexable:

• No reliable estimates but one can expect one fourth is indexed

• Correlation with database subjects: content of books/movies/music databases (relatively ’static’ data) is indexed well

• But, even if known to searchers, data is often outdated

Deep Web: numbers & misconceptions

Lectio Praecursoria 12.06.2008 6

• Approach to automate querying and retrieving information behind search interfaces

• Essential in case of complex queries• A form query language that allows to

formulate queries and extract useful information from the pages with results

• A prototype system for querying web databases

Thesis contributions: querying search interfaces

Lectio Praecursoria 12.06.2008 7

• Previous surveys are based on study of deep web resources mainly in English

• Two new methods for characterizing the deep Web

• Two surveys of one national (Russian) segment of the Web

• Dataset describing more than 200 web databases (statistically reliable)

Thesis contributions: characterization of the deep Web

Lectio Praecursoria 12.06.2008 8

• For any given topic there are too many web databases with relevant content: discovery automation is required

• A system for finding and classifying search interfaces

• Intended for:• Deep Web characterization studies• Building directories of web databases

• Deal with Javascript-rich and non-HTML search forms (these types of forms are ignored in almost all other approaches to the deep Web)

Thesis contributions: finding web databases

Lectio Praecursoria 12.06.2008 9

Applications• Web search engines:

• Eager to improve their coverage of the Web• In April 2008 Google announced they were

experimenting with their form crawler (hence, most likely, other searchers would also have it tested/implemented/etc. in their robots within 2008-2009)

• Information owners and providers• Typically want to disseminate their (publicly-

available) information• Interest in discovery methods as they want their

resources to be discovered and searched• Vertical/topical search engines

• Find information on a specialized topic• Need methods to extract data from relevant

resources and aggregate it

Lectio Praecursoria 12.06.2008 10

Future work

• The most promising direction: discovery of web databases

• The goal: building a relatively complete directory (Yahoo!-like) of databases on the Web

• Specialized directories already exist• Several ‘universal’ directories (e.g.,

completeplanet.com) also exist but, as reported, are outdated and cover only a small portion of deep web resources

• Due to the huge number of existing web databases, building and then maintaining such a directory would require automatic methods (discovery, classification, etc.)