classifiying hidden web data bases

Upload: david-nowakowski

Post on 04-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Classifiying Hidden Web Data Bases

    1/12

    Probe, Count, and Classify:Categorizing Hidden-Web Databases

    Panagiotis G. [email protected]

    Computer Science Dept.Columbia University

    Luis [email protected]

    Computer Science Dept.Columbia University

    Mehran [email protected]

    E.piphany, Inc.

    ABSTRACT

    The contents of many valuable web-accessible databases areonly accessible through search interfaces and are hence in-visible to traditional web crawlers. Recent studies haveestimated the size of this hidden web to be 500 billionpages, while the size of the crawlable web is only an es-timated two billion pages. Recently, commercial web siteshave started to manually organize web-accessible databases

    into Yahoo!-like hierarchical classification schemes. In thispaper, we introduce a method for automating this classi-fication process by using a small number of query probes.To classify a database, our algorithm does not retrieve or in-spect any documents or pages from the database, but ratherjust exploits the number of matches that each query probegenerates at the database in question. We have conductedan extensive experimental evaluation of our technique overcollections of real documents, including over one hundredweb-accessible databases. Our experiments show that oursystem has low overhead and achieves high classification ac-curacy across a variety of databases.

    1. INTRODUCTION

    As the World-Wide Web continues to grow at an expo-nential rate, the problem of accurate information retrievalin such an environment also continues to escalate. One es-pecially important facet of this problem is the ability tonot only retrieve static documents that exist on the web,but also effectively determine which searchable databasesare most likely to contain the relevant information that auser is looking for. Indeed, a significant amount of informa-tion on the web cannot be accessed directly through links,but is available only as a response to a dynamically issuedquery to the search interface of a database. The results pagefor a query typically contains dynamically generated links tothese documents. Traditional search engines cannot handlesuch interfaces and ignore the contents of these resources,since they only take advantage of the static link structureof the web to crawl and index web pages.

    The magnitude of the importance of this problem is high-

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ACM SIGMO D 2001May 21-24, Santa Barbara, California, USACopyright 2001 ACM 1-58113-332-4/01/05 ...$5.00.

    lighted in recent studies [6] that estimated that the amountof information hidden behind such query interfaces out-numbers the documents of the ordinary web by two ordersof magnitude. In particular, the study claims that the sizeof the hidden web is 500 billion pages, compared to onlytwo billion pages of the ordinary web. Also the contents ofthese databases are many times topically cohesive and ofhigher quality than those of ordinary web pages.

    Even sites that have some static links that are crawlable

    by a search engine may have much more information avail-able only through a query interface, as the following realexample illustrates.

    Example 1.: Consider the PubMed medical databasefrom the National Library of Medicine, which stores medi-cal bibliographic information and links to full-text journalsaccessible through the web. The query interface is availableat http://www.ncbi.nlm.nih.gov/PubMed/. If we queryPubMed for documents with the keyword cancer, PubMedreturns 1,301,269 matches, corresponding to high-quality ci-tations to medical articles. The abstracts and citations arestored locally at the PubMed site and are not distributedover the web. Unfortunately, the high-quality contents of

    PubMed are not crawlable by traditional search engines.A query1 on AltaVista2 that finds the pages in the PubMedsite with the keyword cancer, returns only 19,893 matches.This number not only is much lower than the number ofPubMed matches reported above, but, additionally, the pa-ges returned by AltaVista are links to other web pages onthe PubMed site, not to articles in the PubMed database.

    For dynamic environments (e.g., sports), querying a textdatabase may be the only way to retrieve fresh articles thatare relevant, since such articles are often not indexed by asearch engine because they are too new, or because theychange too often.

    In this paper we concentrate on searchable web databasesof text documents regardless of whether their contents arecrawlable or not. More specifically, for our purposes a sea-rchable web database is a collection of text documents thatis searchable through a web-accessible search interface. Thedocuments in a searchable web database do not necessar-ily reside on a single centralized site, but can be scatteredover several sites. Our focus is on text: 84% of all search-able databases on the web are estimated to provide accessto text documents [6]. Other searchable sites offer access to

    1The query is cancer host:www.ncbi.nlm.nih.gov.

    2http://www.altavista.com

  • 8/13/2019 Classifiying Hidden Web Data Bases

    2/12

    other kinds of information (e.g., image databases and shop-ping/auction sites). A discussion on these sites is out of thescope of this paper.

    In order to effectively guide users to the appropriate sea-rchable web database, some web sites (described in moredetail below) have undertaken the arduous task of manuallyclassifying searchable web databases into a Yahoo!-like hier-archical categorization scheme. While we believe this type ofcategorization can be immensely helpful to web users trying

    to find information relevant to a given topic, it is hamperedby the lack of scalability inherent in manual classification.

    Consequently, in this paper we propose a method for theautomaticcategorization of searchable web databases intotopic hierarchies using a combination of machine learningand database querying techniques. Through the use ofqueryprobing, we present a novel and efficient way to classify asearchable database without having to retrieve any of theactual documents in the database. We use machine learningtechniques to initially build a rule-based classifier that hasbeen trained to classify documents that may be hiddenbehind searchable interfaces. Rather than actually usingthis classifier to categorize individual documents, we trans-form the rules of the classifier into a set of query probes

    that can be sent to the search interface for various textdatabases. Our algorithm can then simply use the countsfor the number of documents matching each query to make aclassification decision for the topic(s) of the entire database,without having to analyze any of the actual documents inthe database. This makes our approach very efficient andscalable.

    By providing an efficient automatic means for the accurateclassification of searchable text databases into topic hierar-chies, we hope to alleviate the scalability problems of manualdatabase classification, and make it easier for end-users tofind the relevant information they are seeking on the web.

    The contributions presented in this paper are organizedas follows: In Section 2 we more formally define and pro-vide various strategies for database classification. Section 3

    presents the details of our query probing algorithm for data-base classification. In Sections 4 and 5 we provide the exper-imental setting and results, respectively, of our classificationmethod and compare it with two existing methods for au-tomatic database classification. The method we present isshown to be both more accurate as well as more efficient onthe database classification task. Finally, Section 6 describesrelated work, and Section 7 provides conclusions and out-lines future research directions.

    2. TEXT-DATABASE CLASSIFICATIONAs shown previously, the web hosts many collections of

    documents whose contents are only accessible through asearch interface. In this section we discuss how we can orga-nize the space of such searchable databases in a hierarchicalcategorization scheme. We first define appropriate classifi-cation schemes for such databases in Section 2.1, and thenpresent alternative methods for text database categorizationin Section 2.2.

    2.1 Classification Schemes for DatabasesSeveral commercial web directories have recently started

    tomanuallyclassify searchable web databases, so that userscan browse these categories to find the databases of inter-

    WNBACBS Sportsline

    - Basket

    ESPNBaseball VolleyballBasketball

    Arts SportsScienceHealthComputers

    Root

    ...... ... ... ...

    ...... ...

    ... ...

    Figure 1: Portion of the InvisibleWeb classification

    scheme.

    est. Examples of such directories include InvisibleWeb3 andSearchEngineGuide4. Figure 1 shows a small fraction of In-visibleWebs classification scheme.

    Formally, we can define a hierarchical classification schemelike the one used by InvisibleWeb as follows:

    Definition 1.: A hierarchical classification scheme is arooted directed tree whose nodes correspond to (topic) cate-gories and whose edges denote specialization. An edge fromcategory v to another category v indicates that v is a sub-category ofv.

    Given a classification scheme, our goal is to automaticallypopulate it with searchable databases where we assign eachdatabase to the best category or categories in the scheme.For example, InvisibleWeb has manually assigned WNBAto the Basketball category in its classification scheme. Ingeneral we can define what category or categories are bestfor a given database in several different ways, according tothe needs the classification will serve. We describe differentsuch approaches next.

    2.2 Alternative Classification StrategiesWe now turn to the central issue of how to automatically

    assign databases to categories in a classification scheme, as-suming complete knowledge of the contents of these data-bases. Of course, in practice we will not have such complete

    knowledge, so we will have to use the probing techniquesof Section 3 to approximate the ideal classification defini-tions that we give next.

    To assign a searchable web database to a category or setof categories in a classification scheme, one possibility is tomanually inspect the contents of the database and make adecision based on the results of this inspection. Inciden-tally, this is the way in which commercial web directorieslike InvisibleWeb operate. This approach might producegood quality category assignments, but, of course, is expen-sive (it includes human participation) and does not scalewell to the large number of searchable web databases.

    Alternatively, we could follow a less manual approach anddetermine the category of a searchable web database basedon the category of the documents it contains. We can for-malize this approach as follows: Consider a web databaseD and a number of categories C1, . . . , C n. If we knew thecategory Ci of each of the documents inside D, then wecould use this information to classify database D in at leasttwo different ways. A coverage-basedclassification will as-sign D to all categories for which D has sufficiently manydocuments. In contrast, aspecificity-basedclassification willassign D to the categories that cover a significant fractionofDs holdings.

    3http://www.invisibleweb.com

    4http://www.searchengineguide.com

  • 8/13/2019 Classifiying Hidden Web Data Bases

    3/12

    Example 2.: Consider topic categoryBasketball. CBSSportsLinehas a large number of articles about basketballand covers not only womens basketball but other basket-ball leagues as well. It also covers other sports like football,baseball, and hockey. On the other hand,WNBA only hasarticles about womens basketball. The way that we willclassify these sites depends on the use of our classification.Users who prefer to see only articles relevant to basketballmight prefer a specificity-based classification and would like

    to have the site WNBA classified into node Basketball.However, these users would not want to have CBS Sport-sLine in this node, since this site has a large number ofarticles irrelevant to basketball. In contrast, other usersmight prefer to have only databases with a broad and com-prehensive coverage of basketball in the Basketball node.Such users might prefer a coverage-based classification andwould like to findCBS SportsLinein theBasketball node,which has a large number of articles about basketball, butnot WNBA with only a small fraction of the total numberof basketball documents.

    More formally, we can use the number of documents fiin category Ci that we find in database D to define the fol-lowing two metrics, which we will use to specify the idealclassification ofD .

    Definition 2.: Consider a web database D , a hierarchi-cal classification scheme C, and a category Ci C. Thenthecoverage ofD forCi, Coverage(D, Ci), is the number ofdocuments in D in category Ci, fi.

    Coverage(D, Ci) =fi

    IfCk is the parent of Ci in C, then the specificity ofD forCi, Specificity(D, Ci), is the fraction ofCk documents in Dthat are in category Ci. More formally, we have:

    Specificity(D, Ci) = fi

    |Documents in D aboutCk|

    As a special case, Specificity(D, root) = 1.

    Specificity(D, Ci) gives a measure of how focused the data-baseD is on a subcategory Ci ofCk. The value ofSpecificityranges between 0 and 1. Coverage(D, Ci) defines the abso-lute amount of information that databaseD contains aboutcategory Ci

    5. For notational convenience we define:

    Coverage(D)=Coverage(D, Ci1), . . . , Coverage(D, Cim)

    Specificity(D) =Specificity(D, Ci1), . . . , Specificity(D, Cim)

    when the set of categories {Ci1 , . . . , C im} is clear from thecontext.

    Now, we can use the Specificity and Coverage values todecide how to classify D into one or more categories in theclassification scheme. As described above, aspecificity-based

    classificationwould classify a database into a category whena significant fraction of the documents it contains are ofthis specific category. Alternatively, a coverage-based clas-sificationwould classify a database into a category whenthe database has a substantial number of documents in thegiven category. In general, however, we are interested in

    5It would be possible to normalize Coverage values to be between 0

    and 1 by dividing fi with the total number of documents in categoryCi across all databases. Although intuitively appealing (Coveragewould then measure the fraction of the universally available informa-tion about Ci that is stored in D), this definition is unstable sinceeach insertion, deletion, or modification of a web database changesthe Coverage of the other available databases.

    balancing both Specificity and Coverage through the intro-duction of two associated thresholds, s and c, respectively,as captured in the following definition.

    Definition 3.: Consider a classification scheme C withcategories C1, . . . , C n, and a searchable web database D.Theideal classification ofD inCis the set Ideal(D) of cat-egories Ci that satisfy the following conditions:

    Specificity(D, Ci) s. Specificity(D, Cj) s for all ancestors Cj ofCi.

    Coverage(D, Ci) c.

    Coverage(D, Cj) c for all ancestors Cj ofCi.

    Coverage(D, Ck) < c or Specificity(D, Ck) < s forall children Ck ofCi.

    with 0 s 1 and c 1 given thresholds.

    The ideal classification definition given above provides al-ternative approaches for populating a hierarchical classi-fication scheme with searchable web databases, dependingon the values of the thresholds s and c. A low value forthe specificity threshold s will result in a coverage-basedclassification of the databases. Similarly, a low value for the

    coverage thresholdc will result in a specificity-based classi-fication of the databases. The values of choice for s and care ultimately determined by the intended use and audienceof the classification scheme. Next, we introduce a techniquefor automatically populating a classification scheme accord-ing to the ideal classification of choice.

    3. CLASSIFYING DATABASES THROUGH

    PROBINGIn the previous section we defined how to classify a data-

    base based on the number of documents that it containsin each category. Unfortunately, databases typically do notexport such category-frequency information. In this sectionwe describe how we can approximate this information for a

    given database without accessing its contents. The wholeprocedure is divided into two parts: First we train our sys-tem for a given classification scheme and then we probe eachdatabase with queries to decide the categories to which itshould be assigned. More specifically, we follow the algo-rithm below:

    1. Train a rule-based document classifier with a set ofpreclassified documents (Section 3.1).

    2. Transform classifier rules into queries (Section 3.2).

    3. Adaptively issue queries to databases, extracting andadjusting the number of matches for each query usingthe classifiers confusion matrix (Section 3.3).

    4. Classify databases using the adjusted number of query

    matches (Section 3.4).3.1 Training a Document Classifier

    Our database classification technique relies on a rule-baseddocument classifier to create the probing queries, so our firststep is to train such a classifier. We use supervised learningto construct a rule-based classifier from a set of preclas-sified documents. The resulting classifier is a set of logi-cal rules whose antecedents are conjunctions of words, andwhose consequents are the category assignments for eachdocument. For example, the following rules are part of aclassifier for the three categories Sports, Health, andComputers:

  • 8/13/2019 Classifiying Hidden Web Data Bases

    4/12

    IF ibm AND computer THEN ComputersIF jordan AND bulls THEN SportsIF diabetes THEN HealthIF cancer AND lung THEN HealthIF intel THEN Computers

    Such rules are used to classify previously unseen docu-ments (i.e., documents not in the training set). For exam-ple, the first rule would classify all documents containingthe words ibm and computer into the category Com-puters.

    Definition 4.: Arule-based document classifierfor aflatset of categories C={C1, . . . , C n} consists of a set of rulespk Clk , k = 1, . . . , m, where pk is a conjunction of wordsand Clk C. A document d matches a rule pk Clk ifall the words in that rules antecedent, pk, appear in d. Ifa document d matches multiple rules with different classifi-cation decisions, the final classification decision depends onthe specific implementation of the rule-based classifier.

    To define a document classifier over an entire hierarchicalclassification scheme (Definition 1), we train one flat rule-based document classifier for each internalnode of the hi-erarchy. To produce a rule-based document classifier with aconcise set of rules, we follow a sequence of steps, describedbelow.

    The first step, which helps both efficiency and effective-ness, is to eliminate from the training set all words thatappear very frequently in the training documents, as wellas very infrequently appearing words. This initial featureselection step is based on Zipfs law [32]. Very frequentwords are usually auxiliary words that bear no informationcontent (e.g., am, and, so in English). Infrequentlyoccurring words are not very helpful for classification either,because they appear in so few documents that there are nosignificant accuracy gains from including such terms in aclassifier.

    The elimination of words dictated by Zipfs law is a form

    of feature selection. However, frequency information aloneis not, after some point, a good indicator to drive the featureselection process further. Thus, we use an information the-oretic feature selection algorithm that eliminates the termsthat have the least impact on the class distribution of docu-ments [16, 15]. This step eliminates the features that eitherdo not have enough discriminating power (i.e., words thatare not strongly associated with one specific category) orfeatures that are redundant given the presence of anotherfeature. Using this algorithm we decrease the number offeatures in a principled way and we can use a much smallersubset of words to create the classifier, with minimal loss inaccuracy. Additionally, the remaining features are generallymore useful for classification purposes, so rules constructedfrom these features will tend to be more meaningful for gen-

    eral use.After selecting the features (i.e., words) that we will use

    for building the document classifier, we use RIPPER, a toolbuilt at AT&T Research Laboratories [4], to actually learnthe classifier rules. Once we have trained a document clas-sifier, we could use it to classify all the documents in adatabase of interest. We could then classify the databaseitself according to the number of documents that it containsin each category, as described in Section 2. Of course, thisrequires having access to the whole contents of the database,which is not a realistic requirement for web databases. Werelax this requirement next.

    3.2 Defining Query Probes from a DocumentClassifier

    In this section we show how we can map the classificationrules of a document classifier intoquery probesthat will helpus estimate the number of documents for each category ofinterest in a searchable web database.

    To simulate the behavior of a rule-based classifier over alldocuments of a database, we map each rule pk Clk of

    the classifier into a boolean query qk that is the conjunctionof all words in pk. Thus, if we send the query probe qk tothe search interface of a database D, the query will matchexactly the f(qk) documents in the database D that wouldhave been classified by the associated rule into category Clk .For example, we map the rule IF jordan AND bulls THENSports into the boolean query jordan AND bulls. We ex-pect this query to retrieve mostly documents in the Sportscategory. Now, instead of retrieving the documents them-selves, we just keep the number of matches reported for thisquery (it is quite common for a database to start the re-sults page with a line like Xdocuments found), and usethis number as a measure of how many documents in thedatabase match the condition of this rule.

    From the number of matches for each query probe, we

    can construct a good approximation of the Coverage andSpecificity vectors for a database (Section 2). We can ap-proximate the number of documents fj in Cj in D as thetotal number of matches gj for the Cj query probes. Theresult approximates the distribution of categories of the Ddocuments. Using this information we can approximate theCoverage and Specificityvectors as follows:

    Definition 5.: Consider a searchable web database Dand a rule-based classifier for a set of categories C. For eachquery probeqderived from the classifier, databaseDreturnsthe number of matches f(q). Then the estimated coverageof D for a category Ci C, ECoverage(D,Ci), is the totalnumber of matches for the Ci query probes.

    ECoverage(D,C

    i) =

    q is a query probe for Ci

    f(

    q)

    The estimated specificity of D forCi, ESpecificity(D,Ci), is

    ESpecificity(D,Ci) = ECoverage(D,Ci)

    q is a query probe for any Cj f(q)

    For notational convenience we define:ECoverage(D)= ECoverage(D, Ci1), . . . ,ECoverage(D, Cim)

    ESpecificity(D)=ESpecificity(D, Ci1), . . . ,ESpecificity(D, Cim)

    when the set of categories {Ci1 , . . . , C im} is clear from thecontext.

    Example 3.: Consider a small rule-based document clas-

    sifier for categories C1=Sports, C2=Computers, andC3=Health consisting of the five rules listed in Section 3.1.Suppose that we want to classify the ACM Digital Librarydatabase. We send the query ibm AND computer, which re-sults in 6646 matching documents (Figure 2). The otherfour queries return the matches described in the figure. Us-ing these numbers we estimate that the ACM Digital Li-brary has 0 documents about Sports, 6646+2380=9026documents about Computers, and 18+34=52 documentsabout Health. Thus, its ECoverage(ACM) vector for thisset of categories is (0, 9026, 52) and the ESpecificity(ACM)

    vector is

    0

    0+9026+52, 90260+9026+52

    , 520+9026+52

    .

  • 8/13/2019 Classifiying Hidden Web Data Bases

    5/12

    ACMDigital Library34 matches

    6646 matches

    0 matches

    18 matches

    2380 matches

    diabetes

    intel

    cancer AND lung

    jordan AND bulls

    ibm AND computer

    Figure 2: Sending probes to the ACM Digital Li-brary database with queries derived from a docu-ment classifier.

    3.3 Adjusting Probing ResultsOur goal is to get the exact number of documents in each

    category for a given database. Unfortunately, if we use clas-sifiers to automate this process, then the final result maynot be perfect. Classifiers can misclassify documents into

    incorrect categories, and may not classify some documentsat all if those documents do not match any rules. Thus, weneed to adjust our initial probing results to account for suchpotential errors.

    It is common practice in the machine learning communityto report the document classification results as a confusionmatrix [21]. We adapt this notion of a confusion matrix tomatch our probing scenario:

    Definition 6.: The normalized confusion matrix M =(mij) of a set of query probes for categories C1, . . . , C n is annnmatrix, where mij is the probability of a document incategory Cj being counted as a match by a query probe forcategory Ci. Usually,

    n

    i=1mij = 1 because there is a non-

    zero probability that a document from Cj will not match

    any query probe.

    The algorithm to create the normalized confusion matrixM is:

    1. Generate the query probes from the classifier rules andprobe a database of unseen, preclassified documents(i.e., the test set).

    2. Create an auxiliary confusion matrix X = (xij) andset xij equal to the number of documents from Cjthat were retrieved from probes ofCi.

    3. Normalize the columns ofXby dividing columnj withthe number of documents in the test set in categoryCj. The result is the normalized confusion matrix M.

    Example 4.: Suppose that we have a document clas-sifier for categories C1=Sports, C2=Computers, andC3=Health. Consider 5100 unseen, pre-classified docu-ments with 1000 documents about Sports, 2500 docu-ments about Computers, and 1600 documents about He-alth. After probing this set with the query probes gener-ated from the classifier, we construct the following confusionmatrix:

    M=

    600

    1000

    100

    2500

    200

    1600

    100

    1000

    2000

    2500

    150

    1600

    50

    1000

    200

    2500

    1000

    1600

    =

    0.60 0.04 0.1250.10 0.80 0.093750.05 0.08 0.625

    Element m23 = 150

    1600indicates that 150 C3 documents mis-

    takenly matched probes ofC2 and that there are a total of1600 documents in category C3. The diagonal of the matrixgives the probability that documents that matched queryprobes were assigned to the correct category. For example,m11 =

    600

    1000 indicates that the probability that a C1 docu-

    ment is correctly counted as a match for a query probe forC1 is 0.6.

    Interestingly, multiplying the confusion matrix with theCoverage vector representing the correct number of docu-ments for each category in the test set yields, by definition,theECoveragevector with the number of documents in eachcategory in the test set as matched by the query probes.

    Example 4.: (cont.) The Coveragevector with the ac-tual number of documents in the test set Tfor each categoryis Coverage(T) = (1000, 2500, 1600). By multiplying M bythis vector we get the distribution of T documents in thecategories as estimated by the query probing results.

    0.60 0.04 0.1250.10 0.80 0.093750.05 0.08 0.625

    M

    100025001600

    Coverage(T)

    =

    90022501250

    ECoverage(T)

    Proposition 1.: The normalized confusion matrix M isinvertible when the document classifier used to generate Mclassifies each document correctly with probability > 0.5.

    Proof: From the assumption on the document classifier, itfollows that mii >

    n

    j=0,i=jmij. Hence, M is a diagonally

    dominant matrixwith respect to columns. Then the Gersh-gorin disk theorem [14] indicates that M is invertible. We note that the condition that a classifier have better than0.5 probability of correctly classifying each document is inmost cases true, but a full discussion of this point is beyondthe scope of this paper.

    Proposition 1, together with the observation in Exam-ple 4, suggests a way to adjust probing results to compen-sate for classification errors. More specifically, for an unseendatabaseD that follows the data distribution in our trainingcollections it follows that:

    M Coverage(D) = ECoverage(D)

    Then, multiplying by M1 we have:

    Coverage(D) = M1 ECoverage(D)

    Hence, during the classification of a database D, we willmultiply M1 by the probing results summarized in vec-tor ECoverage(D) to obtain a better approximation of the

    actual Coverage(D) vector. We will refer to this adjust-ment technique as Confusion Matrix Adjustment or CMAfor short.

    3.4 Using Probing Results for ClassificationSo far we have seen how to accurately approximate the

    document category distribution in a database. We now de-scribe a probing strategy to classify a database using theseresults.

    We classify databases in a top-to-bottom way. Each data-base is first classified by the root-level classifier and is thenrecursively pushed down to the lower level classifiers. A

  • 8/13/2019 Classifiying Hidden Web Data Bases

    6/12

    database D is pushed down to the category Cj when bothESpecificity(D,Cj) and ECoverage(D,Cj) are no less thanboth threshold es (for specificity) and ec (for coverage),respectively. These thresholds will typically be equal to thes and c thresholds used for the Ideal classification. Thefinal set of categories in which we classify D is the approxi-mate classification ofD in C.

    Definition 7.: Consider a classification scheme C with

    categories C1, . . . , C n and a searchable web database D. IfESpecificity(D) and ECoverage(D) are the approximationsof the ideal Specificity(D) and Coverage(D) vectors, respec-tively, the approximate classification of D in C, Approxi-mate(D), consists of each category Ci such that:

    ESpecificity(D, Ci) es.

    ESpecificity(D, Cj) es for all ancestors Cj ofCi.

    ECoverage(D, Ci) ec.

    ECoverage(D, Cj) ec for all ancestors Cj ofCi.

    ECoverage(D, Ck) < ec or ESpecificity(D, Ck) < esfor all children Ck ofCi.

    with 0 es 1 and ec 1 given thresholds.

    The algorithm that computes this set is in Figure 3. Toclassify a databaseD in a hierarchical classification scheme,we call Classify(root, D).

    Classify(CategoryC, Database D) {Result =if (C is a leaf node)

    then return CProbe database D with the query probes derived

    from the classifier for the subcategories ofCCalculate ECoverage from the number

    of matches for the probes.ECoverage(D)= M1ECoverage(D)// CMACalculate the ESpecificityvector for Cfor all subcategories Ci ofC

    if (ESpecificity(D, Ci) es AND

    ECoverage(D, Ci) ec)thenResult = Result Classify(Ci, D)if (Result ==)

    then return C// D was not pushed downelse returnResult

    }

    Figure 3: Algorithm for classifying a database D intothe category subtree rooted at category C.

    Example 5.: Figure 4 shows how we categorized theACM Digital Library database. Each node is annotated withthe ECoverage and ESpecificity estimates determined fromquery probes. The subset of the hierarchy that we exploredwith these probes depends on the es and ec thresholds of

    choice, which for this case were es = 0.5 and ec = 100.For example, the subtree rooted at node Science was notexplored, because the ESpecificityof this node, 0.042, is lessthanes. Intuitively, although we estimated that around 430documents in the collection are generally about Science,this was not the focus of the database and hence the low ES-pecificityvalue. In contrast, the Computers subtree wasfurther explored because of its high ECoverage(9919) andESpecificity(0.95), but not beyond its children, since theirESpecificityvalues are less than es. Hence the database isclassified in Approximate={Computers}.

    C/C++ Java Visual BasicPerl

    Arts(0,0)

    Sports(22, 0.008)

    Science(430, 0.042)

    Health(0,0)

    Programming(1042, 0.18)

    Hardware(2709, 0.465)

    Software(2060, 0.355)

    Computers(9919, 0.95)

    Root

    Figure 4: Classifying the ACM Digital Librarydatabase.

    4. EXPERIMENTAL SETTINGWe now describe the data (Section 4.1), techniques we

    compare (Section 4.2), and metrics (Section 4.3) of our ex-perimental evaluation.

    4.1 Data CollectionsTo evaluate our classification techniques, we first define a

    comprehensive classification scheme (Section 2.1) and thenbuild text classifiers using a set of preclassified documents.We also specify the databases over which we tuned andtested our probing techniques.

    Rather than defining our own classification scheme arbi-trarily from scratch we instead rely on that of existing di-rectories. More specifically, for our experiments we pickedthe five largest top-level categories from Yahoo!, which werealso present in InvisibleWeb. These categories are Arts,Computers, Health, Science, and Sports. We thenexpanded these categories up to two more levels by selectingthe four largest Yahoo! subcategories also listed in Invisi-bleWeb. (InvisibleWeb largely agrees with Yahoo! on thetop-level categories in their classification scheme.) The re-

    sulting three-level classification scheme consists of 72 cate-gories, 54 of which are leaf nodes in the hierarchy. A smallfraction of the classification scheme was shown in Figure 4.

    To train a document classifier over our hierarchical clas-sification scheme we used postings from newsgroups thatwe judged relevant to our various leaf-level categories. Forexample, the newsgroups comp.lang.c and comp.lang.c++were considered relevant to category C/C++. We col-lected 500,000 articles from April through May 2000. 54,000of these articles, 1000 per leaf category, were used to trainRIPPER to produce a rule-based document classifier, and27,000 articles were set aside as a test collection for the clas-sifier (500 articles per leaf category). We used the remaining419,000 articles to build controlled databases as we reportbelow.

    To evaluate database classification strategies we use twokinds of databases: Controlled databases that we assem-bled locally and that allowed us to perform a variety of so-phisticated studies, and real Web databases:

    Controlled Database Set: We assembled 500 databasesusing 419,000 newsgroup articles not used in the classifiertraining. As before, we assume that each article is labeledwith one category from our classification scheme, accord-ing to the newsgroup where it originated. Thus, an articlefrom newsgroupscomp.lang.cor comp.lang.c++will be re-garded as relevant to category C/C++, since these news-groups were assigned to category C/C++. The size of the

  • 8/13/2019 Classifiying Hidden Web Data Bases

    7/12

    500Controlleddatabases that we created ranged from 25 to25,000 documents. Out of the 500 databases, 350 are ho-mogeneous, with documents from a single category, whilethe remaining 150 are heterogeneous, with a variety of cat-egory mixes. We define a database as homogeneous whenit has articles from only one node, regardless of whether thisnode is a leaf node or not. If it is not a leaf node, then it hasequal number of articles from each leaf node in its subtree.The heterogeneous databases, on the other hand, have

    documents from different categories that reside in the samelevel in the hierarchy (not necessarily siblings), with differ-ent mixture percentages. We believe that these databasesmodel real-world searchable web databases, with a varietyof sizes and foci. These databases were indexed and queriedby a SMART-based program [25] supporting both booleanand vector-space retrieval models.

    Web Database Set: We also evaluate our techniqueson real web-accessible databases over which we do not haveany control. We picked the first five databases listed inthe InvisibleWeb directory under each node in our classi-fication scheme (recall that our classification scheme is aportion of InvisibleWeb). This resulted in 130 real webdatabases. (Some of the lower level nodes in the classifi-

    cation scheme have fewer than five databases assigned tothem.) 12 databases out of the 130 have articles that werenewsgroup style discussions similar to the databases intheControlledset, while the other 118 databases have arti-cles of various styles, ranging from research papers to filmreviews. For each database in the Web set, we constructeda simple wrapper to send a query and get back the num-ber of matches for each query, which is the only informationthat our database classification procedure requires. Table 1shows a sample of five databases from the Web set.

    4.2 Techniques for ComparisonWe tested variations of our probing technique, which we

    refer to as Probe and Count, against two alternative stra-

    tegies. The first one is an adaptation of the technique de-scribed in [2], which we refer to as Document Sampling.The second one is a method described in [29] that was specif-ically designed for database classification. We will refer tothis method as Title-based Querying. The methods aredescribed in detail below.

    Probe and Count (PnC): This is our technique, de-scribed in Section 3, which uses a document classifier foreach internal node of our hierarchical classification scheme.Several parameters and options are involved in the trainingof the document classifiers. For feature selection, we startby eliminating from consideration any word in a list of 400very frequent words (e.g., a, the) from the SMART [25]information retrieval system. We then further eliminate allinfrequent words that appeared in fewer than three docu-

    ments. We treated the root node of the classification schemeas a special case, since it covers a much broader spectrumof documents. For this node, we only eliminated words thatappeared in fewer than five documents. Also, we consid-ered applying the information theoretic feature selection al-gorithm from [16, 15]. We studied the performance of oursystem without this feature selection step (FS=off) or withthis step, in which we kept only the top 10% most discrim-inating words (FS=on). The main parameters that can bevaried in our database classification technique are thresholdsec (for coverage) and es (for specificity). Different valuesfor these thresholds result in different approximations Ap-

    proximate(D)of the ideal classification Ideal(D).Document Sampling (DS):Callan et al. [2] use query

    probing to automatically construct a language model of atext database (i.e., to extract the vocabulary and associatedword-frequency statistics). Queries are sent to the databaseto retrieve a representative random document sample. Thedocuments retrieved are analyzed to extract the words thatappear in them. Although this technique was not designedfor database classification, we decided to adapt it to our task

    as follows:1. Pick a random word from a dictionary and send a one-

    word query to the database in question.

    2. Retrieve the top-Ndocuments returned by the data-base for the query.

    3. Extract the words from each document and update thelist and frequency of the words accordingly.

    4. If a termination condition is met, go to Step 5; else goto Step 1.

    5. Use a modification of the algorithm in Figure 3 thatprobes the sample document collection rather thanthe database itself.

    For Step 1, we use a random word from the approximately

    100,000 words in our newsgroup collection. For Step 2, weuseN= 4, which is the value that Callan et al. recommendin [2]. Finally, we use the termination condition in Step4 also as described in [2]: the algorithm terminates whenthe vocabulary and frequency statistics associated with thesample document collection converge to a reasonably stablestate (see [2] for details). At this point, the adapted tech-nique can proceed almost identically as in Section 3.4 byprobing the locally stored document sample rather than theoriginal database. A crucial difference b etween the Docu-ment Sampling technique and our Probe and Count tech-nique is that we only use the number of matches reportedby each database, while the Document Samplingtechniquerequires retrieving and analyzing the actual documents fromthe database for the key Step 4 termination condition test.

    Title-based Querying (TQ): Wang et al. [29] presentthree different techniques for the classification of searchableweb databases. For our experimental evaluation we pickedthe method they deemed best. Their technique creates onelong query for each category using the title of the categoryitself (e.g., Baseball) augmented by the titles of all of itssubcategories. For example, the query for category Base-ball is baseball mlb teams minor leagues stadiums statis-tics college university... The query for each category issent to the database in question, the top ranked results areretrieved, and the average similarity [25] of these documentsand the query defines the similarity of the databasewith thecategory. The database is then classified into the categoriesthat are most similar with it. The details of the algorithmare described below.

    1. For each categoryCi:

    (a) Create an associated concept query, which issimply the title of the category augmented withthe titles of its subcategories.

    (b) Send theconcept queryto the database in ques-tion.

    (c) Retrieve the top-N documents returned by thedatabase for this query.

    (d) Calculate the similarity of these N documentswith the query. The average similarity will bethe similarity of the database with category Ci.

  • 8/13/2019 Classifiying Hidden Web Data Bases

    8/12

    URL Brief Description InvisibleWeb Category http://www.cancerbacup.org.uk/search.shtml CancerBACUP Cancerhttp://search.java.sun.com Java@Sun Javahttp://hopkins-aids.edu/index search.html John Hopkins AIDS service AIDShttp://www.agiweb.org/htdig/search.html American Geological Inst. Earth Sciencehttp://mathCentral.uregina.ca/QQ/QQsearch.html MathCentral Mathematics

    Table 1: Some of the real web databases in the Web set.

    2. Rank the categories in order of decreasing similaritywith the database.

    3. Assign the database to the top-K categories of thehierarchy.

    To create the concept queries of Step 1, we augmented ourhierarchy with an extra level of titles, as described in [29].For Step 1(c) we used the value N= 10, as recommendedby the authors. We used the cosine similarity function withtf.idfweighting [24]. Unfortunately, the value ofK for Step3 is left as an open parameter in [29]. We decided to fa-vor this technique in our experiments by revealing to itthe correct number of categories into which each databaseshould be classified. Of course this information would notbe available in a real setting, and was not provided to ourProbe and Countor the Document Samplingtechniques.

    4.3 Evaluation MetricsTo quantify the accuracy of category frequency estimates

    for databaseD we measure the absolute error (i.e., Manhat-tan distance) between the approximation vectors ECover-age(D) and ESpecificity(D), and the correct vectors Cover-age(D) and Specificity(D). These metrics are especially re-vealing during tuning (Section 5.1). The error metric alone,however, cannot give an accurate picture of the systemsperformance, since the main objective of our system is toclassify databases correctly and not to estimate the Speci-ficityand Coveragevectors, which are only auxiliary for thistask.

    We evaluate classification algorithms by comparing the

    approximate classification Approximate(D) that they pro-duce against the ideal classification Ideal(D). We could justreport the fraction of the categories inApproximate(D)thatare correct (i.e., that also appear in Ideal(D)). However, thiswould not capture the nuances of hierarchical classification.For example, we may have classified a database in categorySports, while it is a database about Basketball. Themetric above would consider this classification as absolutelywrong, which is not appropriate since, after all, Basket-ball is a subcategory of Sports. With this in mind, weadapt the precision and recall metrics from information re-trieval [3]. We first introduce an auxiliary definition. Givena set of categories N, we expand it by including all thesubcategories of the categories in N. Thus Expanded(N) ={c C|c N or c is in a subtree of somen N}. Now,

    we can define precisionand recallas follows.

    Definition 8.: Consider a databaseD classified into theset of categories Ideal(D), and an approximation of Ideal(D)given inApproximate(D).LetCorrect = Expanded(Ideal(D))andClassified = Expanded(Approximate(D)). Then the pre-cision andrecallof the approximate classification ofD are:

    precision = |Correct Classified|

    |Classified|

    recall = |Correct Classified|

    |Correct|

    To condense precision and recall into one number, we usethe F1-measure metric [28],

    F1 = 2 precision recall

    precision+recall

    which is only high when both precision and recall are high,and is low for design options that trivially obtain high pre-cision by sacrificing recall or vice versa.

    Example 6.: Consider the classification scheme in Fig-ure 4. Suppose that the ideal classification for a databaseD is Ideal(D)={Programming}. Then, the Correct set ofcategories include Programming and all its subcategories,

    namely C/C++, Perl, Java, and Visual Basic. Ifwe approximate Ideal(D) as Approximate(D)={Java} us-ing the algorithm in Figure 3, then we do not manage to cap-ture all categories in Correct. In fact we miss four out of fivesuch categories and hence recall=0.2 for this database andapproximation. However, the only category in our approx-imation, Java, is a correct one, and hence precision=1.TheF1-measure summarizesrecalland precisionin one num-ber, F1 =

    210.21+0.2

    = 0.33.

    An important property of classification strategies over theweb is scalability. We measure the efficiency of the var-ious techniques that we compare by modelling their cost.More specifically, the main costwe quantify is the numberof interactions required with the database to be classified,

    where each interaction is either a query submission (neededfor all three techniques) or the retrieval of a database docu-ment (needed only for Document Sampling and Title-basedQuerying). Of course, we could include other costs in thecomparison (namely, the cost of parsing the results and pro-cessing them), but we believe that they would not affectour conclusions, since these costs are CPU-based and smallcompared to the cost of interacting with the databases overthe Internet.

    5. EXPERIMENTAL RESULTSWe now report experimental results that we used to tune

    our system (Section 5.1) and to compare the different clas-sification alternatives both for the Controlleddatabase set

    (Section 5.2) and for the Web database set (Section 5.3).

    5.1 Tuning the Probe and Count TechniqueOur Probe and Count technique has some open param-

    eters that we tuned experimentally by using a set of 100Controlled databases (Section 4.1). These databases willnot participate in any of the subsequent experiments.

    Effect of Confusion Matrix Adjustment (CMA):The first parameter we examined was whether the confu-sion matrix adjustment of the probing results was helpful ornot. The absolute error associated with the ECoverageandESpecificity approximations was consistently smaller when

  • 8/13/2019 Classifiying Hidden Web Data Bases

    9/12

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    Ts

    F1-measure

    DS PnC TQ

    (a)

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    1 2 4 8 16 3 2 64 128 256 512 1024 2048 4096

    Tc

    F1-mea

    sure

    DS

    PnC

    TQ

    (b)

    Figure 5: The average F1-measure of the threetechniques (a) for varying specificity threshold s(c = 8), and (b) for varying coverage threshold c(s = 0.3).

    we used CMA, confirming our motivation behind the intro-duction of CMA. (Due to lack of space we do not report thisplot.) Consequently, we fix this parameter and use confusionmatrix adjustment for all the remaining experiments.

    Effect of Feature Selection: As we described in Sec-tion 4.2, we can apply an information theoretic feature se-lection step before training a document classifier. We ranour Probe and Count techniques with (FS=on) and with-out (FS=off) this feature selection step. The estimates ofboth theCoverageand Specificityvectors that we computedwithFS=onwere consistently more accurate than those forFS=off. More specifically, the ECoverage estimates withFS=on were between 15% and 20% better, while the ES-pecificity estimates with FS=on were around 10% better.(Again, due to space restrictions we do not include moredetailed results.) Intuitively, our information theoretic fea-ture selection step results in robust classifiers that use fewernoisy terms for classification. Consequently, we fix this

    parameter and useFS=onfor all the remaining experiments.We now turn to reporting the results of the experimental

    comparison of Probe and Count, Document Sampling, andTitle-based Queryingover the 400 unseen databases in theControlledset and the 130 databases in the Web set.

    5.2 Results over the Controlled DatabasesAccuracy for Different s and c Thresholds: As ex-

    plained in Section 2.2, Definition 3, the ideal classification ofa database depends on two parameters: s (for specificity)andc(for coverage). The values of these parameters are aneditorial decision and depend on whether we decide thatour classification scheme is specificity- or coverage-oriented,as discussed previously. To classify a database, both theProbe and Count andDocument Samplingtechniques needanalogous thresholdses andec. We ran experiments overthe Controlled databases for different combinations of thes and c thresholds, which result in different ideal classi-fications for the databases. Intuitively, for low specificitythresholds theIdealclassification will have the databasesassigned mostly to leaf nodes, while a high specificity thresh-old might lead to databases being classified at more generalnodes. Similarly, low coverage thresholds c produce Idealclassifications where the databases are mostly assigned tothe leaves, while higher values of c tend to produce classi-fications with the databases assigned to higher level nodes.

    For Probe and Count and Document Sampling we set

    es= sandec= c. Title-based Queryingdoes not use anysuch threshold, but instead needs to decide how many cate-goriesKto assign a given database (Section 4.2). Althoughof course the value of K would be unknown to a classifica-tion technique (unlike the values for thresholds s and c),we reveal Kto this technique, as discussed in Section 4.2.

    Figure 5(a) shows the average value of the F1-measure forvaryings and for c= 8, over the 400 unseen databases intheControlledset. The results were similar for other values

    ofc as well. Probe and Countperforms the best for a widerange ofsvalues. The only case in which it is outperformedbyTitle-based Queryingis whens = 1. For this setting evenvery small estimation errors forProbe and CountandDocu-ment Samplingresult in errors in the database classification(e.g., even if Probe and Count estimates 0.9997 specificityfor one category it will not classify the database into thatcategory, due to its low specificity). Unlike DocumentSampling, Probe and Count (except for s = 1 and s = 0)andTitle-based Queryinghave almost constant performancefor different values ofs. Document Samplingis consistentlyworse than Probe and Count, showing that sampling usingrandom queries is inferior than using a focused, carefullychosen set of queries learned from training examples.

    Figure 5(b) shows the average value of the F1-measurefor varying c and for s = 0.3. The results were similarfor other values ofs as well. Again, Probe and Count out-performs the other alternatives and, except for low cover-age thresholds (c 16), Title-based QueryingoutperformsDocument Sampling. It is interesting to see that the per-formance ofTitle-based Querying improves as the coveragethresholdcincreases, which might indicate that this schemeis better suited for coverage-based classification schemes.

    Effect of Depth of Hierarchy in Accuracy: An inter-esting question is whether classification performance is af-fected by the depth of the classification hierarchy. We testedthe different methods against adjusted versions of our hi-erarchy of Section 4.1. Specifically, we first used our originalclassification scheme with three levels (level=3). Then we

    eliminated all the categories of the third level to create ashallower classification scheme (level=2). We repeated thisprocess again, until our classification schemes consisted ofone single node (level=0). Of course, the performance of allthe methods at this point was perfect. In Figure 6 we com-pare the performance of the three methods for s= 0.3 andc = 8 (the trends were the same for other threshold com-binations as well). Probe and Countperforms better thanthe other techniques for different depths, with only a smoothdegradation in performance for increasing depth, which sug-gests that our approach can scale to a large number of cate-gories. On the other hand,Document SamplingoutperformsTitle-based Queryingbut for both techniques the differencein performance with Probe and Count increases for hierar-

    chies of larger depth.Efficiency of the Classification Methods: As we dis-cussed in Section 4.3, we compare the number of queriessent to a database during classification and the number ofdocuments retrieved, since the other costs involved are com-parable for the three methods. The Title-based Queryingtechnique has a constant cost for each classification: it sendsone query for each category in the classification scheme andretrieves 10 documents from the database. Thus, this tech-nique sends 72 queries and retrieves 720 documents for our72-node classification scheme. Our Probe and Count tech-nique sends a variable number of queries to the database

  • 8/13/2019 Classifiying Hidden Web Data Bases

    10/12

    0.50

    0.55

    0.60

    0.65

    0.70

    0.75

    0.80

    0.85

    0.90

    0.95

    1.00

    0 1 2 3Level

    F1-measure

    DS PnC TQ

    Figure 6: The average F1-measure for hierarchies ofdifferent depths (s = 0.3, c = 8).

    500

    750

    1000

    1250

    1500

    1750

    2000

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Tes

    InteractionswithDatabase

    PnC DS TQ

    (a)

    250

    500

    750

    1000

    1250

    1500

    1750

    2000

    1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192Tec

    InteractionswithDatabase

    PnC DS TQ

    (b)

    Figure 7: The average number of interactionswith the databases (a) as a function of thresholdes (ec = 8), and (b) as a function of threshold ec(es = 0.3).

    being classified. The exact number depends on how many

    times the database will be pushed down a subcategory(Figure 3). Our technique does not retrieve any documentsfrom the database. Finally, theDocument Samplingmethodsends queries to the database and retrieves four documentsfor each query until the termination condition is met. Welist in Figure 7(a) the average number of interactionsfor varying values of specificity threshold es and ec =8. Figure 7(b) shows the average number of interactionsfor varying coverage threshold ec and es = 0.3. Docu-ment Samplingis consistently the most expensive method,while Title-based Querying performs fewer interactionsthan Probe and Countfor low values for specificity thresh-old es andec, when Probe and Counttends to push downdatabases more easily, which in turn translates into more

    query probes.In summary,Probe and Countis by far the most accuratemethod. Furthermore, its cost is lowest for most combina-tions of the es and ec thresholds and only slightly higherthan the cost ofTitle-based Querying, a significantly less ac-curate technique, for the remaining combinations. Finally,the Probe and Countquery probes are short, consisting onaverage of only 1.5 words, with a maximum of four words.In contrast, the average Title-based Queryingquery probeconsisted of 18 words, with a maximum of 348 words.

    5.3 Results over the Web DatabasesThe experiments over the Web databases involved only

    1

    4

    16

    64

    256

    1024

    4096

    16384

    65536

    262144

    00.2

    0.4 0

    .60.8

    1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    F1-measure

    Tec

    Tes

    Figure 8: Average F1-measure values for differentcombinations of es and ec.

    the Probe and Count technique. The main reason for thiswas the prohibitive cost of running such experiments forthe Document Samplingand the Title-based Querying tech-niques, which would have required constructing wrappersfor each of the 130 databases in the Web set. Such wrap-pers would have to extract all necessary document pointersfrom result pages from each query probe returned by thedatabase, so defining them involves non-trivial human effort.In contrast, the wrappers needed by theProbe and Counttechnique are significantly simpler, which is a major advan-tage of our approach. As we will discuss in Section 7, theProbe and Countwrappers only need to extract the numberof matches from each results page, a task that could be au-tomated since the patterns used by search engines to reportthe number of matches for queries are quite uniform.

    For the experiments over the Controlled set, the classifi-

    cation thresholds s and c of choice were known. In con-trast, for the databases in the Web set we are assuming thattheir Ideal classification is whatever categories were chosen(manually) by the InvisibleWeb directory (Section 4.1). Thisclassification of course does not use the s and c thresholdsin Definition 3, so we cannot use these parameters as intheControlledcase. However, we assume that InvisibleWeb(and any consistent categorization effort) implicitly uses thenotion of specificity and coverage thresholds for their clas-sification decisions. Hence we try and learn such thresholdsfrom a fraction of the databases in the Web set, use thesevalues as the es and ec thresholds for Probe and Count,and validate the performance of our technique over the re-maining databases in the Web set.

    Accuracy for Differentsand cThresholds: For theWeb set, the Idealclassification for each database is takenfrom InvisibleWeb. To find the s and c that are implic-itly used by human experts at InvisibleWeb we have splitthe Web set in three disjoint sets W1, W2, and W3. Wefirst use the union of W1 and W2 to learn these values ofs and c by exhaustively exploring a number of combina-tions and picking the es and ec value pair that yielded thebest F1-measure (Figure 8). As we can see, the best valuescorresponded to es = 0.3 and ec = 16, with F1 = 0.77.To validate the robustness of the conclusion, we tested theperformance ofProbe and Countover the third subset of theWeb set, W3: for these values ofes and ec the F1-measure

  • 8/13/2019 Classifiying Hidden Web Data Bases

    11/12

    TrainingSubset

    Learneds,c

    F1-measureover Train-ing Subset

    TestSubset

    F1-measureover TestSubset

    W1W2 0.3, 16 0.77 W3 0.79W1W3 0.3, 8 0.78 W2 0.75W2W3 0.3, 8 0.77 W1 0.77

    Table 2: Results of three-fold cross-validation overthe Webdatabases.

    14

    16

    64

    256

    1024

    4096

    16384

    65536

    262144

    0

    0.3

    0.6

    0.9

    0

    100

    200

    300

    400

    500

    600

    QueryProbes

    Tec

    Tes

    Figure 9: Average number of query probes for the

    Webdatabases as a function of es and ec.

    over the unseen W3 set was 0.79, very close to the one overtraining setsW1, W2. Hence, the training to find the s andc values was successful, since the pair of thresholds thatwe found performs equally well for the InvisibleWeb catego-rization of unseen web databases. We performed three-foldcross-validation [21] for this threshold learning by trainingon W2 and W3 and testing on W1, and finally learning onW1 and W3 and testing on W2. Table 2 summarizes the re-sults. The results were consistent, confirming the fact thatthe values of es = 0.3 and ec 8 are not overfitting thedatabases in our Web set.

    Effect of Depth of Hierarchy in Accuracy: We alsotested our method for hierarchical classification schemes ofvarious depths using es= 0.3 andec= 8. The F1-measurewas 1, 0.89, 0.8, and 0.75 for hierarchies of depth zero, one,two, and three respectively. We can see that F1-measuredrops smoothly as the hierarchy depth increases, which leadsus to believe that our method can scale to even larger clas-sification schemes without significant penalties in accuracy.

    Efficiency of the Classification Method: The cost ofthe classification for the different combinations of the thresh-olds is depicted in Figure 9. As the thresholds increase, thenumber of queries sent decreases, as expected, since it ismore difficult to push a database down a subcategory andtrigger another probing phase. The cost is generally low:only a few hundred queries suffice on average to classify a

    database with high accuracy. Specifically, for the best set-ting of thresholds (s = 0.3 and c = 8), the Probe andCount method sends on average only 185 query probes toeach database in theWeb set. As we mentioned, the averagequery probe consists of only 1.5 words.

    6. RELATED WORKWhile work in text database classification is relatively

    new, there has been substantial on-going research in textdocument classification. Such research includes the applica-tion of a number of learning algorithms to categorizing textdocuments. In addition to the rule-based classifiers based

    on RIPPER used in our work, other methods for learningclassification rules based on text documents have been ex-plored [1]. Furthermore, many other formalisms for doc-ument classifiers have been the subject of previous work,including the Rocchio algorithm based on the vector spacemodel for document retrieval [23], linear classification al-gorithms [17], Bayesian networks [18], and, most recently,support vector machines [13], to name just a few. More-over, extensive comparative studies among text classifiers

    have also been performed [26, 7, 31], reflecting the relativestrengths and weaknesses of these various methods.

    Orthogonally, a large body of work has been devoted tothe interaction with searchable databases, mainly in theform of metasearchers [9, 19, 30]. A metasearcher receivesa query from a user, selects the best databases to which tosend the query, translates the query in a proper form for eachsearch interface, and merges the results from the differentsources.

    Query probing has been used in this context mainly forthe problem of database selection. Specifically, Callan etal. [2] probe text databases with random queries to deter-mine an approximation of their vocabulary and associatedstatistics (language model). (We adapted this technique

    for the task of database classification to define the Docu-ment Sampling technique of Section 4.2.) Craswell et al.[5] compared the performance of different database selec-tion algorithms in the presence of such language models.Hawking and Thistlewaite [11] used query probing to per-form database selection by ranking databases by similarityto a given query. Their algorithm assumed that the queryinterface can handle normal queries and query probes dif-ferently and that the cost to handle query probes is smallerthan that for normal queries. Recently, Etzioni and Sug-iura [27] used query probing for query expansion to routeweb queries to the appropriate search engines.

    Query probing has also been used for other tasks. Menget al. [20] used guided query probing to determine sourcesof heterogeneity in the algorithms used to index and search

    locally at each text database. Query probing has been usedby Etzioni et al. [22] to automatically understand queryforms and extract information from web databases to builda comparative shopping agent. In [10] query probing wasemployed to determine the use of different languages on theweb.

    For the task of database classification, Gauch et al. [8]manuallyconstruct query probes to facilitate the classifica-tion of text databases. In [12] we presented preliminary workon database classification through query probes, on whichthis paper builds. Weng et al. [29] presented theTitle-basedQuerying technique that we described in Section 4.2. Ourexperimental evaluation showed that our technique signif-icantly outperforms theirs, both in terms of efficiency andeffectiveness. Our technique also outperforms our adapta-tion of the random document sampling technique in [2].

    7. CONCLUSIONS AND FUTURE WORKWe have presented a novel and efficient method for the hi-

    erarchical classification of text databases on the web. Afterproviding a more formal description of this task, we pre-sented a scalable algorithm for the automatic classificationof such databases into a topic hierarchy. This algorithmemploys a number of steps, including the learning of a rule-based classifier that is used as the foundation for generat-ing query probes, a method for adjusting the result counts

  • 8/13/2019 Classifiying Hidden Web Data Bases

    12/12

    returned by the queried database, and finally a decision cri-terion for making classification assignments based on thisadjusted count information. Our technique does not requireretrieving any documents from the database. Experimentalresults show that the method proposed here is both moreaccurate and efficient than existing methods for databaseclassification.

    While in our work we focus on rule-based approaches, wenote that other learning algorithms are directly applicable

    in our approach. A full discussion of such transformationsis beyond the scope of this paper. We simply point out thatthe database classification scheme we present is not boundto a single learning method, and may be improved by simul-taneous advances from the realm of document classification.

    A further step that would completely automate the classi-fication process is to eliminate the need for a human to con-struct the simple wrapper for each database to classify. Thisstep can be eliminated by automatically learning how toparse the pages with query results. Perkowitz et al. [22] havestudied how to automatically characterize and understandweb forms, and we plan to apply some of these results toautomate the interaction with search interfaces. Our tech-nique is particularly well suited for this automation, since it

    needs only very simple information from result pages (i.e.,the number of matches for a query). Furthermore, the pat-terns used to report the number of matches for queries bythe search engines and tools that are popular on the webare quite similar. For example, one representative pattern isthe appearance of the word of before reporting the actualnumber of matches for a query (e.g., 30 outof1024 matchesdisplayed). 76 out of the 130 web databases in the Web setuse this pattern to report the number of matches, and ofcourse there are other common patterns as well. Based onthis anecdotal information, it seems realistic to envision acompletely automatic classification system.

    Acknowledgments

    Panagiotis G. Ipeirotis is partially supported by Empeiri-keio Foundation and he thanks the Trustees of EmpeirikeioFoundation for their support. We also thank Pedro FalcaoGoncalves for his contributions during the initial stages ofthis project. This material is based upon work supported bythe National Science Foundation under Grants No. IIS-97-33880 and IIS-98-17434. Any opinions, findings, and con-clusions or recommendations expressed in this material arethose of the authors and do not necessarily reflect the viewsof the National Science Foundation.

    8. REFERENCES[1] C. Apte, F. Damerau, and S. M. Weiss. Automated learning of

    decision rules for text categorization. Transactions of OfficeInformation Systems, 12(3), 1994.

    [2] J. P. Callan, M. Connell, and A. Du. Automatic discovery of

    language models for text databases. In SIGMOD 1999,Proceedings ACM SIGMOD International Conference onManagement of Data, pages 479490, 1999.

    [3] C. W. Cleverdon and J. Mills. The testing of index languagedevices. Aslib Proceedings, 15(4):106130, 1963.

    [4] W. W. Cohen. Learning trees and rules with set-valuedfeatures. In Proceedings of AAAI96, IAAI96, volume 1,pages 709716. AAAI, 1996.

    [5] N. Craswell, P. Bailey, and D. Hawking. Server selection on theWorld Wide Web. In Proceedings of the Fifth ACMConference on Digital Libraries, pages 3746. ACM, 2000.

    [6] The Deep Web: Surfacing Hidden Value. Accessible athttp://www.completeplanet.com/Tutorials/DeepWeb/index.asp .

    [7] S. T. Dumais, J. Platt, D. Heckerman, and M. Sahami.Inductive learning algorithms and representations for text

    categorization. In Proceedings of CIKM-98, 1998.

    [8] S. Gauch, G. Wang, and M. Gomez. Profusion*: Intelligentfusion from multiple, distributed search engines. The Journalof Universal Computer Science, 2(9):637649, Sept. 1996.

    [9] L. Gravano, H. Garca-Molina, and A. Tomasic. GlOSS:Text-Source Discovery over the Internet. ACM Transactionson Database Systems, 24(2):229264, June 1999.

    [10] G. Grefenstette and J. Nioche. Estimation of English andnon-English language use on the WWW. In RIAO 2000, 2000.

    [11] D. Hawking and P. B. Thistlewaite. Methods for informationserver selection. ACM Transactions on Information Systems,

    17(1):4076, Jan. 1999.[12] P. G. Ipeirotis, L. Gravano, and M. Sahami. Automatic

    classification of text databases through query probing. InProceedings of the ACM SIGMOD Workshop on the Web andDatabases (WebDB2000), May 2000.

    [13] T. Joachims. Text categorization with support vector machines:Learning with many relevant features. In ECML-98, 10thEuropean Conference on Machine Learning, pages 137142.Springer, 1998.

    [14] R. L. Johnston. Gershgorin theorems for partitioned matrices.Linear Algebra and its Applications, 4:205220, 1971.

    [15] D. Koller and M. Sahami. Toward optimal feature selection. InMachine Learning, Proceedings of the ThirteenthInternational Conference (ICML 96), pages 284292, 1996.

    [16] D. Koller and M. Sahami. Hierarchically classifying documentsusing very few words. In Machine Learning, Proceedings of theFourteenth International Conference on Machine Learning(ICML 97), pages 170178, 1997.

    [17] D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka.Training algorithms for linear text classifiers. In Proceedings ofACM/SIGIR, 1996.

    [18] A. McCallum and K. Nigam. A comparison of event models fornaive bayes text classification. In Learning for TextCategorization: Papers from the 1998 AAAI Workshop, 1998.

    [19] W. Meng, K.-L. Liu, C. T. Yu, X. Wang, Y. Chang, andN. Rishe. Determining text databases to search in the Internet.In VLDB98, Proceedings of 24th International Conferenceon Very Large Data Bases, pages 1425, 1998.

    [20] W. Meng, C. T. Yu, and K.-L. Liu. Detection of heterogeneitiesin a multiple text database environment. In Proceedings of theFourth IFCIS International Conference on CooperativeInformation Systems, pages 2233, 1999.

    [21] T. Mitchell.Machine Learning. McGraw Hill, 1997.

    [22] M. Perkowitz, R. B. Doorenbos, O. Etzioni, and D. S. Weld.Learning to understand information on the Internet: Anexample-based approach. Journal of Intelligent InformationSystems, 8(2):133153, Mar. 1997.

    [23] J. J. Rocchio. Relevance feedback in information retrieval. InG. Salton, editor, The SMART Information Retrieval System,pages 313323. Prentice Hall, Englewood Cliffs, NJ, 1971.

    [24] G. Salton and C. Buckley. Term-weighting approaches inautomatic text retrieval. Information Processing andManagement, 24:513523, 1988.

    [25] G. Salton and M. J. McGill. The SMART and SIREexperimental retrieval systems. In K. S. Jones and P. Willett,editors, Readings in Information Retrieval, pages 381399.Morgan Kaufmann, 1997.

    [26] H. Schuetze, D. Hull, and J. Pedersen. A comparison ofdocument representations and classifiers for the routingproblem. In Proceedings of the 18th Annual ACM SIGIRConference, pages 229237, 1995.

    [27] A. Sugiura and O. Etzioni. Query routing for web searchengines: Architecture and experiments. InProceedings of theNinth International World-Wide Web Conference, 2000.

    [28] K. van Rijsbergen. Information Retrieval (2nd edition).

    Butterworths, London, 1979.[29] W. Wang, W. Meng, and C. Yu. Concept hierarchy based textdatabase categorization in a metasearch engine environment. InProceedings of First International Conference on WebInformation Systems Engineering (WISE2000), June 2000.

    [30] J. Xu and J. P. Callan. Effective retrieval with distributedcollections. In SIGIR 98: Proceedings of the 21st AnnualInternational ACM SIGIR Conference on Research andDevelopment in Information Retrieval, pages 112120, 1998.

    [31] Y. Yang and X. Liu. A re-examination of text categorizationmethods. In Proceedings of the 22nd Annual ACM SIGIRConference, pages 4249, 1999.

    [32] G. K. Zipf.Human Behavior and the Principle of LeastEffort. Addison-Wesley, 1949.