deep web: databases on the web

Deep Web: Databases on the Web Denis Shestakov

Turku Centre for Computer Science, Finland

I N T R O D U C T I O N Finding information on the Web using a web search engine is one of the primary activities of today’s web users. For a majority of users results returned by conventional search engines are an essentially complete set of links to all pages on the Web relevant to their queries. However, current-day searchers do not crawl and index a significant portion of the Web and, hence, web users relying on search engines only are unable to discover and access a large amount of information from the non-indexable part of the Web. Specifically, dynamic pages generated based on parameters provided by a user via web search forms are not indexed by search engines and cannot be found in searchers’ results. Such search interfaces provide web users with an online access to myriads of databases on the Web. In order to obtain some information from a web database of interest, a user issues his/her query by specifying query terms in a search form and receives the query results, a set of dynamic pages which embed required information from a database. At the same time, issuing a query via an arbitrary search interface is an extremely complex task for any kind of automatic agents including web crawlers, which, at least up to the present day, do not even attempt to pass through web forms on a large scale. Content provided by many web databases is often of very high quality and can be extremely valuable to many users. For example, the PubMed database (http://www.pubmed.gov) allows a user to search through millions of high-quality peer-reviewed papers on biomedical research, while the AutoTrader car classifieds database at http://autotrader.com is highly useful for anyone wishing to buy or sell a car. In general, since each searchable database is a collection of data in a specific domain it can often provide more specific and detailed information that is not available or hard to find in the indexable Web. The following section provides background information on the non-indexable Web and web databases.

B A C K G R O U N D Conventional web search engines index only a portion of the Web, called the publicly indexable Web, which consists of publicly available web pages reachable by following hyperlinks. The rest of the Web known as the non-indexable Web can be roughly divided into two large parts. The first is the part consisting of protected web pages, i.e., pages that are password-protected, marked as non-indexable by webmasters using the Robots Exclusion Protocol, or only available when visited from certain networks (i.e., corporate intranets). Web pages accessible via web search forms (or search interfaces) comprise the second part of non-indexable Web known as the deep Web (Bergman, 2001) or the hidden Web (Florescu, Levy & Mendelzon, 1998). The deep Web is not completely unknown to search engines: some web databases can be accessed not only through web forms but also through link-based navigational interfaces (e.g., a typical shopping site usually allows browsing products in some subject hierarchy). Thus, a certain part of the deep Web is, in fact, indexable as web searchers can technically reach content of those

databases with browse interfaces. Separation into indexable and non-indexable portions of the Web is summarized in Figure 1. The deep Web shown in gray is formed by those public pages that are accessible via web search forms.

Figure 1. Indexable and non-indexable portions of the Web and deep Web

Figure 2. User interaction with web database

A user interaction with a web database is depicted schematically in Figure 2. A web user queries a database via a search interface located on a web page. First, search conditions are specified in a form and then submitted to a web server. Middleware (often called a server-side script) running on a web server processes a user’s query by transforming it into a proper format and passing to an underlying database. Second, a server-side script generates a resulting web page by embedding results returned by a database into page template and, finally, a web server sends the generated page with results back to a user. Frequently, results do not fit on one page and, therefore, the page returned to a user contains some sort of navigation through the other pages with results. The resulting pages are often termed data-rich or data-intensive (Crescenzi, Mecca & Merialdo, 2001). While unstructured content is dominating on the publicly available Web the deep Web is a huge source of structured data. Web databases can be broadly classified into two types: 1) structured web databases, that provide data objects as structured relational-like records with attribute-value pairs, and 2) unstructured web databases, that provide data objects as unstructured media (e.g., text, images, audio, video) (Chang et al., 2004). For example, being a collection of research paper abstracts the PubMed database mentioned earlier is considered to be an unstructured database, while autotrader.com has a structured database

for car classifieds, which returns structured car descriptions (e.g., make = “Toyota”, model = “Corolla”, price = $5000). According to the recent survey of Chang et al. (2004), structured databases dominate in the deep Web – approximately four in five web databases are structured. The distinction between unstructured and structured databases is meaningful since the unstructured content provided by web databases of the first type can be adequately well handled by web searchers. The structured content is much different in this respect as to be fully utilized it should be first extracted from data-intensive pages and then processed in a way similar to instances of a traditional database.

C H A L L E N G E S O F D E E P W E B

Deep Web Characterization

Though the term deep Web was coined in 2000 (Bergman, 2001), which is sufficiently long ago for any web-related concept/technology, many important characteristics of the deep Web are still unknown. For instance, the total size of the deep Web (i.e., how much information is stored in web databases) may only be guessed. At the same time, the existing estimates for the total number of deep web sites are considered to be reliable and consistent. It has been estimated that 43,000 - 96,000 deep web sites existed in March 2000 (Bergman, 2001). A 3-7 times increase in the four years has been reported in the survey (Chang et al., 2004), where the total numbers for deep web sites and web databases in April 2004 have been estimated as around 310,000 and 450,000 correspondingly. Moreover, even national- or domain-specific subsets of deep web resources are, in fact, very large collections. For example, in summer 2005 there were around 7,300 deep web sites on the Russian part of the Web only (Shestakov & Salakoski, 2007). Or the bioinformatics community alone has been maintaining almost one thousand molecular biology databases freely available on the Web (Galperin, 2007).

As mentioned earlier, the deep Web is partially indexed by web search engines because the content of some web databases is accessible both via hyperlinks and search interfaces. According to the rough estimates obtained based on the analysis of 80 databases in eight domains (Chang et al., 2004), around 25% of deep web data is covered by Google (http://google.com). At the same time, most deep web data indexed by Google is out-of-date, i.e., pages returned by Google contain out-of-date information – only each fifth indexed page is fresh.

Querying Web Databases

The Hidden Web Exposer (HiWE) (Raghavan & Garcia-Molina, 2001) is an example of the deep web crawler, i.e., a system that provides a web crawler the ability to automatically fill out search interfaces. Each time when the HiWE finds a web page with a search interface, it performs three steps as follows:

1. Constructs a logical representation of a search form by associating a field label (descriptive text that helps a user to understand the semantics of a form field) to each form field. For example, as shown in Figure 2, “Car Make” is a field label for the upper form field.

2. Takes values from a pre-existing value set (initially provided by a user) and submits the form by assigning the most relevant values to form fields.

3. Analyzes the response and, if found valid, retrieves the resulting pages and stores them in the repository (pages in which can be then indexed to support user queries, just like a conventional search engine).

One of the major challenges for deep web crawlers such as the HiWE as well as for any other automatic agents that query web databases is understanding semantics of search interface. Web search forms are created and designed for human beings, but not for computer applications. Consequently, such simple task (from the position of a user) as filling out a search form turns out to be a computationally hard problem. Recognition of field labels that describe fields of a search interface (see Figure 2) is a challenging task since there are myriads of web coding practices and, thus, no common rules regarding the relative position of two elements, that declare a field label and its corresponding field, in the HTML code of a web page. The HiWE introduces the Layout-based Information Extraction Technique (LITE) (Raghavan & Garcia-Molina, 2001), where in addition to the code of a page, the physical layout of a page is also used to aid in extraction. LITE, using the layout engine, identifies the pieces of text, if any, that are physically closest to the form field, in the horizontal and vertical directions. These pieces of text (potential field labels) are then ranked based on several heuristics and the highest ranked candidate is considered to be a field label associated with the form field. The different approach to automatically issuing queries to textual (unstructured) database was presented by Callan and Connell (2001). The method called query-based sampling exploits result from a first query to extract new query terms, which are then submitted, and so on iteratively. The approach has been applied with some success to siphon data hidden behind keyword-based search interfaces (Barbosa & Freire, 2004). Ntoulas et al. (2005) have proposed an adaptive algorithm to select the best keyword queries, namely keyword queries that can return a large fraction of database content. Particularly, it has been shown that with only 83 queries, it is possible to download almost 80% of all documents stored in the PubMed database. Query selection for crawling of structured web databases was studied by Wu et al. (2006), who modeled a structured database as an attribute-value graph. Then, the query-based web database crawling is transformed into a graph traversal activity, which is parallel to the traditional problem of crawling the publicly indexable Web following hyperlinks. Although the HiWE provides a very effective approach to crawl the deep Web, it does not extract data from the resulting pages. The DeLa (Data Extraction and Label Assignment) system (Wang & Lochovsky, 2003) sends queries via search forms, automatically extracts data objects from the resulting pages, fits the extracted data into a table and assigns labels to the attributes of data objects, i.e., columns of the table. The DeLa adopts the HiWE to detect the field labels of search interfaces and to fill out search interfaces. The data extraction and label assignment approaches are based on two main observations. First, data objects embedded in the resulting pages share a common tag structure and they are listed continuously in the web pages. Thus, regular expression wrappers can be automatically generated to extract such data objects and fit them into a table. Second, a search interface, through which web users issue their queries, gives a sketch of (part of) the underlying database. Based on this, extracted field labels are matched to the columns of the table, thereby annotating the attributes of the extracted data.

The HiWE system (and hence the DeLa system) works with relatively simple search interfaces. Besides, it does not consider the navigational complexity of many deep web sites. Particularly, resulting pages may contain links to other web pages with relevant information and consequently it is necessary to navigate these links for evaluation of their relevancies. Additionally, obtaining pages with results sometimes requires two or even more successful form submissions (as a rule, the second form is returned to refine/modify search conditions provided in the previous form). The DEQUE (Deep Web Query System) (Shestakov, Bhowmick & Lim, 2005) addressed these challenges and proposed a web form query language called DEQUEL that allows a user to assign more values at a time to the search interface’s fields than it is possible when the interface is manually filled out. For example, data from relational tables or data obtained from querying other web databases can be used as input values. The data extraction from resulting pages in the DEQUE is based on a modified version of the RoadRunner technique (Crescenzi, Mecca & Merialdo, 2001). Among research efforts solely devoted to information extraction from resulting pages, Caverlee et al. (2005) presented the Thor framework for sampling, locating, and partitioning the query-related content regions (called Query-Answer Pagelets or QA-Pagelets) from the deep Web. The Thor is designed as a suite of algorithms to support the entire scope of the deep Web information platform. The Thor data extraction and preparation subsystem supports a robust approach to the sampling, locating, and partitioning of the QA-Pagelets in four phases: 1) the first phase probes web databases to sample a set of resulting pages that covers all possible structurally diverse resulting pages, such as pages with multiple records, single record, and no record; 2) the second phase uses a page clustering algorithm to segment resulting pages according to their control-flow dependent similarities, aiming at separating pages that contain query matches from pages that contain no matches; 3) in the third phase, pages from top-ranked clusters are examined at a subtree level to locate and rank the query-related regions of the pages; and 4) the final phase uses a set of partitioning algorithms to separate and locate itemized objects in each QA-Pagelet.

Searching for Web Databases

Previous subsection discussed how to query web databases assuming that search interfaces to web databases of interest were already discovered. Surprisingly, finding of search interfaces to web databases is a challenging problem in itself. Indeed, since several hundred thousands of databases are available on the Web (Chang et al., 2004) one cannot be aware that he/she knows the most relevant databases even in a very specific domain. Realizing that, people have started to manually create web database collections such as the Molecular Biology Database Collection (Galperin, 2007). However, manual approaches are not practical as there are hundreds of thousands databases. Besides, since new databases are constantly being added, the freshness of a manually maintained collection is highly compromised. Barbosa and Freire (2005) proposed a new crawling strategy to automatically locate web databases. They built a form-focused crawler that uses three classifiers: page classifier (classifies pages as belonging to topics in taxonomy), link classifier (identifies links that are likely to lead to the pages with search interfaces in one or more steps), and form classifier. The form classifier is a domain-independent binary classifier that uses a decision tree to determine whether a web form is searchable or non-searchable (e.g.,

forms for login, registration, leaving comments, discussion group interfaces, mailing list subscriptions, purchase forms, etc.).

F U T U R E T R E N D S Typically, search interface and results of search are located on two different web pages (as depicted in Figure 2). However, due to advances of web technologies, a web server may send back only query results (in a specific format) rather than a whole generated page with results (Garrett, 2005). In this case, a page is populated with results on the client-side: specifically, a client-side script embedded within a page with search form receives the data from a web server and modifies the page content by inserting the received data. Ever-growing use of client-side scripts will pose a technical challenge (Alvarez et al., 2004) since such scripts can modify or constrain the behavior of a search form and, moreover, can be responsible for generation of resulting pages. Another technical challenge is dealing with non-HTML search interfaces such as interfaces implemented as Java applets or in Flash (Adobe Flash). It is worth to mention here that all approaches described in the previous section work with HTML-forms only and do not take into account client-side scripts. Web search forms are user-friendly interfaces to web databases rather than program-friendly. However, more and more web databases begin to provide APIs (i.e., program-friendly interfaces) to their content. The prevalence of APIs will solve most of the problems related to the querying of databases and, at the same time, will make the problem of integrating deep Web data a very active area of research. For instance, one of the challenges we will face is how to select only a few databases, which content is the most appropriate for a given user’s query, among a large pool of databases. Nevertheless, one can expect merely a gradual migration towards access web databases via APIs and, thus, at least for recent years, many web databases will still be accessible only through user-friendly web forms.

C O N C L U S I O N The existence and continued growth of the deep Web creates new challenges for web search engines, which are attempting to make all content on the Web easily accessible by all users (Madhavan et al., 2006). Though much research has emerged in recent years on querying web databases, there is still a great deal of work to be done. The deep Web will require more effective access techniques since the traditional crawl-and-index techniques, which have been quite successful for unstructured web pages in the publicly indexable Web, may not be appropriate for mostly structured data in the deep Web. Thus, a new field of research combining methods and research efforts of database and information retrieval communities may be created.

B I B L I O G R A P H Y Alvarez, M., Pan, A., Raposo, J., & Vina, J. (2004). Client-side deep web data extraction. IEEE Conference on e-Commerce Technology for Dynamic E-Business (CEC-East’04), 158-161.

Barbosa, L., & Freire, J. (2004). Siphoning hidden-web data through keyword-based interfaces. 19

th Brazilian Symposium on Data Base (SBBD’04), 309-321.

Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. 8th

International

Workshop on the Web and Databases (WebDB’05), 1-6. Bergman, M. (2001). The deep Web: surfacing hidden value. Journal of Electronic

Publishing, 7(1). Callan, J., & Connell, M. (2001). Query-based sampling of text databases. ACM

Transactions on Information Systems (TOIS), 19(2), 97-130. Caverlee, J., & Liu, L. (2005). QA-Pagelet: data preparation techniques for large-scale data analysis of the deep Web. IEEE Transactions on Knowledge and Data Engineering,

17(9), 1247-1262. Chang, K., He, B., Li, C., Patel, M., & Zhang, Z. (2004). Structured databases on the Web: observations and implications. SIGMOD Rec., 33(3), 61-70. Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner: towards automatic data extraction from large web sites. 27

th International Conference on Very Large Data Bases

(VLDB’01), 109-118. Florescu, D., Levy, A., & Mendelzon, A. (1998). Database techniques for the World-Wide Web: a survey. SIGMOD Rec., 27(3), 59-74. Galperin, M. (2007). The molecular biology database collection: 2007 update. Nucleic

Acids Research, 35(Database-Issue), 3-4. Garrett, J. (2005). Ajax: a new approach to web applications. AdaptivePath. Retrieved October 8, 2007, from http://www.adaptivepath.com/publications/essays/archives/000385.php Madhavan, J., Halevy, A., Cohen, S., Dong, X., Jeffery, S., Ko, D., & Yu, C. (2006). Structured data meets the Web: a few observations. IEEE Date Engineering Bulletin,

29(4), 19-26. Ntoulas, A., Zerfos, P., & Cho, J. (2005). Downloading textual hidden web content through keyword queries. 5th ACM/IEEE-CS Joint Conference on Digital Libraries

(JCDL’05), 100-109. Raghavan, S., & Garcia-Molina, H. (2001). Crawling the hidden Web. 27

th International

Conference on Very Large Data Bases (VLDB’01), 129-138. Shestakov, D., Bhowmick, S., & Lim, E.-P. (2005). DEQUE: querying the deep Web. Data&Knowledge Engineering, 52(3), 273-311.

Shestakov, D., & Salakoski, T. (2007). On estimating the scale of national deep Web. 18th

International Conference on Database and Expert Systems Applications (DEXA’07), 780-789. Wang, J., & Lochovsky, F. (2003). Data extraction and label assignment for web databases. 12

th International Conference on World Wide Web (WWW’03), 187-196.

Wu, P., Wen, J.-R., Liu, H., & Ma, W.-Y. (2006). Query selection techniques for efficient crawling of structured web sources. 22nd International Conference on Data Engineering (ICDE’06), 47. Adobe Flash, Retrieved October 8, 2007, from http://www.adobe.com/products/flash/

K E Y T E R M S

Deep Web (also hidden Web) – the part of the Web that consists of all web pages accessible via search interfaces. Deep web site – a web site that provides an access via search interface to one or more back-end web databases. Non-indexable Web (also invisible Web) – the part of the Web that is not indexed by the general-purpose web search engines. Publicly indexable Web – the part of the Web that is indexed by the major web search engines. Search interface (also search form) – a user-friendly interface located on a web site that allows a user to issue queries to a database. Web database – a database accessible online via at least one search interface. Web crawler – an automated program used by search engines to collect web pages to be indexed.

deep web: databases on the web

Internet