a scalable comparison-shopping agent fot the www

Upload: davidmorenoazua

Post on 05-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 A Scalable Comparison-Shopping Agent Fot the Www

    1/10

    A Scalable Comparison-Shopping Agentfor the World-Wide Web

    Robert Bo Doorenbos, Oren Etzioni, and Daniel So WeldDepartment of Computer Science and Engineering

    University of WashingtonSeattle, WA 98195

    {bobd, etzioni, weld}~cs.washington.edu

    AbstractThe World- Wide- Web is less agent-friendly than wemight hope. Most information on the Web is presentedin loosely structured natural language text with noagent-readable semantics. HTML annotations struc-ture the display of Web pages, but provide virtually noinsight into their content. Thus, the designersof intel-ligent Web agents need to address he following ques-tions: (1) To what extent can an agent understandinformation published at Web sites? (2) Is the agent'sunderstanding sufficient to provide genuinely usefulassistance to users? (3) Is site-specific hand-codingnecessary,or can the agent automatically extract in-formation from unfamiliar Web sites? (4) What as-pects of the Web facilitate this competence?In this paper we investigate these issues with a casestudy using ShopBot, a fully-implemented, domain-independent comparison-shopping agent. Given thehome pages of several online stores, ShopBot au-tonomously learns how to shop at those vendors. Afterlearning, it is able to speedily visit over a dozen soft-ware and CD vendors, extract product information,and summarize the results for the user. Preliminarystudies show that ShopBot enables users to both findsuperior prices and substantially reduce Web shoppingtime.Remarkably, ShopBot achieves his performance with-out sophisticated natural language processing, and re-quires only minimal knowledge about different prod-uct domains. Instead, ShopBot relies on a combinationof heuristic search, pattern matching, and inductivelearning techniques.

    PERMISSION TO COPY WITHOUT FEE ALL OR OR PARTOF THIS MATERIAL IS GRANTED PROVIDED THAT THECOPIES ARE NOT MADE OR DISTRIBUTED FOR DI-RECT COMMERCIAL ADVANTAGE, THE ACM copy-RIGHT NOTICE AND THE TITLE OF THE PUBLICATIONAND ITS DATE APPEAR, AND NOTICE IS GIVEN THATCOPYING IS BY PERMISSION OF ACM. To COPY OTH-ERWISE, OR TO REPUBLISH, REQUIRES A FEE AND/ORSPECIFIC PERMISSION. AGENTS '97 CONFERENCEPROCEEDINGS, COPYRIGHT 1997 ACM.

    1 IntroductionIn recent years, AI researchers have created severalprototype software agents that help users with emailand netnews filtering (Maes & Kozierok 1993), Webbrowsing (Armstrong et al. 1995; Lieberman 1995),meeting scheduling (Dent et al. 1992; Mitchell et al.1994; Maes 1994), and internet-related tasks (Etzioni& Weld 1994). Increasingly, the information suchagents need to access is available on the World- WideWeb. Unfortunately, the Web is less agent-friendlythan we might hope. Although Web pages are writtenin HTML, this language only defines how informationis to be displayed, not what it means. There has beensome talk of semantic markup of Web pages, but it isdifficult to imagine a semantic markup language thatis expressive enough to cover the diversity of informa-tion on the Web, yet simple enough to be universallyadopted.Thus, the advent of the Web raises several funda-mental questions for the designers of intelligent soft-ware agents:.Ability: To what extent can intelligent agents un-derstand information published at Web sites?.Utility: Is an agent's ability great enough to pro-vide substantial added value over a sophisticatedWeb browser coupled with directories and indicessuch as Yahoo and Lycos?.Scalability: Existing agents rely on a hand-codedinterface to Internet services and Web sites (Krul-wich 1996; "Etzioni & Weld 1994; Arens et al. 1993;Perkowitz et al. 1996; Levy, Srivastava, & Kirk1995). Is it possible for an agent to approach anunfamiliar Web site and automatically extract in-formation from the site?.Environmental Constraint: What properties ofWeb sites underlie the agent's competence? Issophisticated natural language understanding nec-essary? How much domain-specific knowledge isneeded?While we cannot answer all of the above questions con-clusively in a single conference paper, we investigate

  • 8/2/2019 A Scalable Comparison-Shopping Agent Fot the Www

    2/10

  • 8/2/2019 A Scalable Comparison-Shopping Agent Fot the Www

    3/10

    shopping agent called ShopBot. ShopBot operates intwo phases: in the learning phase, an omine learnercreates a vendor description for each merchant; in thecomparison-shopping phase, a real-time shopper usesthese descriptions to help a person decide which storeoffers the best price for a given product.The learning phase, illustrated in Figure 1 (a), ana-lyzes online vendor sites to learn a symbolic descriptionof each site. This phase is moderately computation-ally expensive, but is performed omine, and needs tobe done only once per store.2 Table 1 summarizes theproblem tackled by the learner for each vendor. Thelearner's job is essentially to find a procedure for ex-tracting appropriate information from an online ven-dor.

    these issues by means of a case study in the domain ofelectronic commerce.This paper introduces ShopBot, a fully implementedcomparison-shopping agent.l We demonstrate the util-ity of ShopBot by comparing people's ability to findcheap prices for a suite of computer software productswith and without the ShopBot. ShopBot is able toparse product descriptions and identify several prod-uct attributes, including price and operating system,for the products. It achieves this performance with-out sophisticated natural language processing, and re-quires only minimal knowledge about different productdomains. Instead, it extracts information from onlinevendors via a combination of heuristic search, patternmatching, and inductive learning techniques -withsurprising effectiveness. Our experiments demonstratethe generality of ShopBot's architecture both within adomain (we test it on a suite of online software shops)and across domains (we test it on another domain, on-line music CD stores).The rest of this paper is organized as follows. Webegin with a brief description of the online shoppingtask in Section 2. Section 3 provides a detailed de-scription of the ShopBot prototype and the principlesupon which it is built. In Section 4 we present experi-ments that demonstrate ShopBot's usefulness and gen-erality. Finally, we discuss related work in Section 5,and conclude with a critique of ShopBot and directionsfor future work.

    Table I: The Extraction Procedure Learning Problem

    2 The Online-Shopping TaskOur long-term goal is to design, implement, and ana-lyze shopping agents that can help users with all as-pects of online shopping. The capabilities of a sophis-ticated shopping assistant would include: 1) helpingthe user decide what product to buy, e.g., by listingwhat products of a certain type are available, 2) findingspecifications and reviews of them, 3) making recom-mendations, 4) comparison shopping to find the bestprice for the desired product, 5) monitoring "What'snew" lists and other sources to discover new relevantonline information sources, 6) and watching for specialoffers and discounts.In the remainder of this paper, we discuss our fully-implemented ShopBot prototype. As a first step,we have focused on comparison shopping. Whileother shopping subtasks remain topics for future work,ShopBot is already demonstrably useful (see Section 4).ShopBot's capabilities ( and limitations) form a baselinefor future work in this area.

    The comparison-shopping phase, illustrated in Fig-ure I (b), uses the learned vendor descriptions to shopat each site and find the best price for a specific productdesired by the user. It simply executes the extractionprocedures found by the learner for a variety of vendorsand presents the user with a summary of the results.This phase executes very quickly, with network delaysdominating ShopBot computation time.The ShopBot architecture is product-independent -to shop in a new product domain, it simply needs adescription of that domain. To date, we have testedShopBot in the domains of computer software and mu-sic CDs. The domain description consists of the in-formation listed in Table I, plus some domain-specificheuristics used for filling out HTML search forms, aswe describe below. Supplying a domain descriptionis beyond the capability of the average user; in fact,it is difficult if not impossible for an expert to pro-vide the necessary information without some investi-

    3 ShopBot: A Comparison-ShoppingAgentOur initial research focus has been the design, con-struction, and evaluation of a scalable comparison- 2If a vendor "remodels" the store, providing differentsearchable ndices, or a different search result page format,then this phase must be repeated for that vendorlThe current version of ShopBot s publicly accessibleathttp://www.cs.washington.edu/research/shopbot.

  • 8/2/2019 A Scalable Comparison-Shopping Agent Fot the Www

    4/10

    the combination that will yield successful comparisonshopping.

    Table 2: A vendor description.The learner's basic method is to first find a set of

    candidate forms -possibilities for the first decision.For each form Fi, it computes an estimate Ei of howsuccessful the comparison-shopping phase would be ifform Fi were chosen by the learner. To estimate this,the learner determines how to fill in the form (thisis the second decision), and then makes several "testqueries" using the form to search for several popularproducts. The results of these test queries are usedfor two things. They provide training examples fromwhich the learner induces the format of product de-scriptions in the result pages from form Fi (this is thethird decision). The results of the test queries are alsoused to compute Ei -the learner's success in findingthese popular products provides an estimate of howwell the comparison-shopping phase will do for users'desired products. Once estimates have been obtainedfor all the forms, the learner picks the form with thebest estimate, and records a vendor description com-prising this form's URL and the corresponding secondand third decisions that were made for it.In the rest of Section 3.2, we elaborate on this pro-cedure. We do not claim to have developed an optimalprocedure; indeed, the optimal one will change as ven-dor sites evolve. Consequently, our emphasis is on thearchitecture and basic techniques rather than low-Ieveldetails.Finding and Analyzing Candidate Forms. Thelearner begins by finding potential search forms. Itstarts at the vendor's home page and follows URLlinks, performing a heuristic search looking for anyHTML forms at the vendor's site. (To avoid putting anexcessive load on the site, we limit the number of pagesthe learner is allowed to fetch.) Since most vendorshave more than one HTML form, this procedure usu-ally results in multiple candidate forms. Some simpleheuristics are used to discard forms that are clearly not

    be easy to identify because they would always con-tain the product name, but this is not always the case;moreover, the product name often appears in otherplaces on the result page, not just in product descrip-tions. We also suspected that the presence of a pricewould serve as a clue to identifying product descrip-tions, but this intuition also proved false -for somevendors the product description does not contain aprice, and for others it is necessary to follow a URLlink to get the price. In fact, the format of productdescriptions varied widely and no simple rule workedrobustly across different products and different ven-dors.However, the regularities we observed above sug-gested a learning approach to the problem. We con-sidered using standard grammar inference algorithms( e.g. (Berwick & Pilato 1987; Schlimmer & Hermens1993)) to learn regular expressions that capture prod-uct descriptions, but such algorithms require large setsof labeled example product descriptions -preciselywhat our ShopBot lacks when it encounters a new ven-dor. We don't want to require a human to look at thevendor's web site and label a set of example productdescriptions for the learner. In short, standard gram-mar inference is inappropriate for our task because itis data intensive and relies on supervised learning. In-stead, we adopted an unsupervised learning algorithmthat induces what the product descriptions are, givenseveral example pages.Based on the uniformity regularity, we assume allproduct descriptions (at a given site) have the sameformat at a certain level of abstraction. The basic ideaof our algorithm is to search through a space of possibleabstract formats and pick the best one. Our algorithmtakes advantage of the vertical separation regularity togreatly reduce the size of the search space. We discuss

    this in greater detail below.Overview. The learner automatically generatesa vendor description for an unfamiliar online mer-chant. Together with the domain description, a ven-dor description contains all the knowledge required bythe comparison-shopping phase for finding products atthat vendor. Table 2 shows the information containedin a vendor description. The problem of learning sucha vendor description has three components:.Identifying an appropriate search form,.Determining how to fill in the form, and.Discerning the format of product descriptions in

    pages returned from the form.These components represent three decisions the learnermust make. The three decisions are strongly interde-pendent, of course -e.g., the learner cannot be surethat a certain search form is the appropriate one untilit knows it can fill it in and understand the result-ing pages. In essence, the ShopBot learner searchesthrough a space of possible decisions, trying to pick

  • 8/2/2019 A Scalable Comparison-Shopping Agent Fot the Www

    5/10

    ular products given in the domain description. Itmatches each result page for one of these productsagainst the failure template; any page that does notmatch the template is assumed to represent a success-ful search. If the majority of the test queries are fail-ures rather than successes, the learner assumes thatthis is not the appropriate search form to use for thevendor. Otherwise, the learner records generalizedtemplates for the header and tailer of success pages,by abstracting out references to product attributes andthen finding the longest matching prefixes and suffixesof the success pages obtained from the test queries.The learner now uses the bodies of these pages fromsuccessful searches as training examples from which toinduce the format of product descriptions in the resultpages for this form. Each such page contains one ormore product descriptions, each containing informa-tion about a particular product (or version of a prod-uct) that matched the query parameters. However, asdiscussed above, extracting these product descriptionsis difficult, because their format varies widely acrossvendors.

    We use an unsupervised learning algorithm that in-duces what the product descriptions are, given thepages. Our algorithm requires only a handful of train-ing examples, because it employs a very strong biasbased on the uniformity and vertical separation reg-ularities described in Section 3.1. Based on the uni-formity regularity, we assume all product descriptionshave the same format at a certain level of abstraction.6The algorithm searches through a space of possible ab-stract formats and picks the best one. Our abstrac-tion language consists of strings of HTML tags and/orthe keyword text. The abstract form of a fragment ofHTML is obtained by removing the arguments fromHTML tags and replacing all occurrences of interven-ing free-form text with text. For example, the HTMLsource:Click

  • 8/2/2019 A Scalable Comparison-Shopping Agent Fot the Www

    6/10

    through a graphical user interface (GUI) based on thedomain description. The operation of the shopper isfairly simple. Once it has received a request from theuser via the GUI, it goes in parallel to each onlinevendor's searchable index, and fills out and submitsthe forms. For each resulting page not matching thevendor's failure template, it strips off the header andtailer, and looks in the remaining HTML code for anyresults -any logical lines matching the learned prod-uct description format. It then sorts the results byascending order of price,7 and generates a summaryfor the user .

    ShopBot: Summary of Results

    SI-. from ht!D:/lwwwJnt=&- used"". ..-'onnODd gotbaa~St-. from ht!D:/lwwwovbout=m/ovb.",n""" used "' ..~rn fo~ ODd otbaa ~SI-. from ht!D:/lwwwavolan.nl." used "' ..-'orm ODd o,baa ~

    Mac:

    WlndOM:~~~';~~~'i:i NN.. V1nIW1NnnW\"-"""""""CD-ROM"""J 34"'2

    Figure 2: A snapshot of ShopBot shopping for Quicken.

    4 Empirical ResultsIn this section we consider the overall usefulness ofthe current ShopBot prototype, the ease with whichShopBot can learn new vendors in the software domain,and its degree of domain independence.

    So the algorithm breaks the body of each result pageinto logical lines representing vertical-space-delimitedtext, and then only considers abstract formats thatcorrespond to at least one of the logical lines in oneof the result pages. Thus, instead of being linear inthe size of the original hypothesis space, the learningalgorithm is linear in the amount of training data, i.e.,the number and sizes of result pages.The bodies of success pages typically contain logi-callines with a wide variety of abstract formats, onlyone of which corresponds to product descriptions. (See(Doorenbos, Etzioni, & Weld 1996) for some exam-pIes. The learner uses a heuristic ranking process tochoose which format is most likely to be the one thestore uses for product descriptions. Our current rank-ing function is the sum of the number of lines of thatformat in which some text (not just whitespace) wasfound, plus the number in which a price was found,plus the number in which one or more of the requiredattributes were found. This heuristic exploits the factthat since the test queries are for popular products,vendors tend to stock multiple versions of each prod-

    uct, leading to an abundance of product descriptionson a successful page. Different vendors have very dif-ferent product formats, but this algorithm is broadlysuccessful, as we show in Section 4.Generating the Vendor Description. TheShopBot learner repeats the procedure just describedfor each candidate form. The final step is to decidewhich form is the best one to use for comparison shop-ping. As mentioned above, this choice is based on mak-ing an estimate Ei for each form Pi of how successfulthe comparison-shopping phase would be if form Piwere chosen by the learner. The Ei used is simply thevalue of the heuristic ranking function for the winningabstract format. This function reflects both the num-ber of the popular products that were found and theamount of information present about each one. Theexact details of the heuristic ranking function do notappear to be crucial, since there is typically a largedisparity between the rankings of the "right" form andalternative "wrong" forms.Once the learner has chosen a form, it records a ven-dor description (Table 2) for future use by the ShopBotshopper described in the next section. If the learnercan't find any form that yields a successful search on amajority of the popular products, then ShopBot aban-dons this vendor.The ShopBot learner runs offline, once per merchant.Its running time is linear in the number of vendors, thenumber of forms at a vendor's site, the number of "testqueries," and the number of lines on the result pages.The learner typically takes 5-15 minutes per vendor.3.3 Real-Time Comparison ShoppingLearned vendor descriptions are used by the ShopBotshopper to do comparison shopping, as illustrated inFigure 1 (b). The shopper interacts with the user

    4.1 Evaluating ShopBot UtilityWe conjectured that there are two components respon-sible for the ShopBot's utility. First, the ShopBot shop-per acts as a repository of knowledge about the web:since the user interacts with the shopper after thelearning phase has been completed, ShopBot is able toimmediately access online vendors and search for theuser's desired product. Second, the ShopBot shopper is

    7Prices are extracted using special hand-codedtechniques.

  • 8/2/2019 A Scalable Comparison-Shopping Agent Fot the Www

    7/10

    slightly lower than the users in group 1 found above,because the data in Table 4 was obtained at a laterdate.9) This demonstrates the generality of ShopBot'sarchitecture and learning algorithm within the soft-ware domain. The table also shows the variability inboth price and availability across vendors, which mo-tivates comparison shopping in the first place.4.3 Generality Across Product DomainsWe have created a new domain description that en-ables ShopBot to shop for pop/rock CD's. We chosethe CD domain, first used by the hand-crafted agentBargainFinder (Krulwich 1996), to demonstrate the ver-satility and scope of ShopBot's architecture. With oneday's work on describing the CD domain, we were ableto get ShopBot to shop successfully at four CD stores.BargainFinder currently shops successfully at three.l0So with a day's work, we were able to get ShopBot intothe same ballpark as a domain-specific hand-craftedagent.Of course, we do not claim our approach will workwith every online vendor. In fact, we know of severalvendors where it currently fails, because its learningalgorithm uses such a strong bias that it cannot cor-rectly learn their formats. Nevertheless, the fact thatit works on all ten software vendors found by an in-dependent source strongly suggests that sites where itfails are not abundant.

    both methodical and effective at actually finding prod-ucts at a given vendor; most users are too impatientto perform a manual exhaustive search. For our firstexperiment, we attempted to measure the usefulness ofthe current prototype ShopBot and to determine whichcomponent was responsible for the utility. We enlistedseven subjects who were novices at electronic shopping,but who did have experience using Netscape. We di-vided the subjects into three groups:

    1. Those who used ShopBot (3 subjects),2. Those who used Netscape's search tools and werealso given the URLs of twelve software stores used

    by ShopBot (2 subjects), and3. Those who were limited to Netscape's search tools

    (2 subjects).Two independent parties suggested popular software

    items, yielding descriptions of four products: NetscapeNavigatior and Hummingbird eXceed for Windows,and Microsoft Word and Intuit Quicken for the Mac-intosh. We asked all subjects to try to find the bestprice for these products and to report how long it tookthem. Table 3 presents the mean time and prices foreach group.It is perhaps unsurprising that ShopBot users com-pleted their task much faster than the other subjects,but there are several interesting observations to drawfrom Table 3. First, subjects limited to Netscape'ssearch methods never found a lower price than ShopBotusers. Second, although we thought the list of storeURLs might make group 2 subjects more effective thanShopBot users, the URLs actually slowed the subjectsdown. We suspect the tedium of checking stores re-peatedly caused group 2 subjects to make mistakes aswell. For example, one group 2 subject failed to finda price for eXceed ( he other found a low price on aninappropriate version). It seems clear that ShopBot'sutility is due to both its knowledge and its painstakingsearch.4.2 Acquisition of New Software VendorsTo assess he generality of the ShopBot architecture, weasked an independent person not familiar with ShopBotto find online vendors that sell popular software prod-ucts and that have a search index at their Web site.The subject found ten such vendors, and ShopBot isable to shop at all of them.8 ShopBot currently shopsat twelve software vendors: the aforementioned tenplus two more we found ourselves and used in the orig-inal design of the system. Table 4 shows the pricesit found for each of four test products chosen by inde-pendent subjects at each store. (Some of the prices are

    BOf these ten, four were sites we had studied while de-signing ShopBot, while six were new. ShopBot requires spe-cia! assistanceon one site, where the only way to get to thesearch orm is to go through an image map; since ShopBotcannot understand images,we gave t the URL of the searchform page instead of the URL of the home page.

    5 Related WorkWe can view related agent work as populating a three-dimensional space where the axes are the agent's task,the extent to which the agent tolerates unstructuredinformation, and whether its interface to external re-sources is hand-coded. In this section, we contrastShopBot with related agents along one or more of thesedimensions. ShopBot is unique in its ability to learn toextract information from the semi-structured text pub-lished by Web vendors.Much of the related agent work requires structuredinformation of the sort found in a relational database(e.g., (Levy, Srivastava, & Kirk 1995; Arens et al.1993)). The Internet Softbot (Etzioni & Weld 1994)is also able to extract information from the rigidly for-matted output of UNIX commands such as Is andInternet services such as netfind. There are agentsthat analyze unstructured Web pages, but they do so

    gOne subject in group 2 managed to find a lower pricefor Quicken than ShopBot, by going to a web site ShopBotdidn't know about.lOIn our informal experiments, BargainFinder tried toshop at eight stores. Three of them blocked out its access.It was successfulat three others. At the remaining two, ittried but failed to find products. We were able to find prod-ucts at these two stores "by hand," so BargainFinder's ail-ure may be the result of buggy or out-of-date hand-craftedcode written just for these two stores (which motivates ouruse of a learning algorithm in the first place).

  • 8/2/2019 A Scalable Comparison-Shopping Agent Fot the Www

    8/10

    Navigator eXceed Word Quicken1 13:20 30.71 373.06 282.71 42.952 112:30 38.21 (not found) 282.71 41.50 I3 58:30 40.95 610.00 294.97 42.95 I

    Table 3: Subjects using the ShopBot performed the task much faster and generally found lower prices.

    Home Page URL Navigator eXceed Word Quickenhttp : / /www .internet. net/ $ 28.57 $ 282.71~43~o6http://www.cybout.com/cyberian.html 36.95 289.95 42.95http://necxdirect.necx.com/ 31.95 329.95 42.95http://www.sparco.com/ 35.00 312.00 49.00http://www.warehouse.com/ 39.95http://www.cexpress.com/ ? ? ?http://www.avalon.nf.ca/ 44.95http://www.azteq.com/ ? ?http://www.cdw.com/ 289.52http://www.insight.com/web/zdad.html 315.00http://www.applied-computer.com/twelcome.htmlhttp://www.sidea.com/

    ?

    $ 349.56 43.4759.00Table 4: Prices found by ShopBot for the four test products at twelve software stores. A space left blank indicatesthat ShopBot successfully recognized that the vendor was not selling this product; "?" indicates ShopBot foundthe product but did not determine the price; "-,, indicates that ShopBot failed to find the product even though thevendor was selling it.

    lied on hand-coded "wrappers" to parse the responsefrom each Web site into a small, ordered list of rele-vant tokens. Thus, ShopBot is solving a different learn-ing problem than ILA: instead of trying to interpreteach of a list of relevant tokens, ShopBot attemptsto identify the relevant tokens and learn the formatin which they are presented. ShopBot replaces ILA'shand-coded wrappers with an inductive learning algo-rithm biased to take advantage of the regularities inWeb store fronts.

    Along the task dimension, BargainFinder (Krul-wich 1996) is the closest agent to ShopBot. Indeed,ShopBot's task was inspired by BargainFinder's feasi-bility demonstration and popularity. However, thereare major technical differences between BargainFinderand ShopBot. Whereas BargainFinder must be hand-tailored for each store it shops at, the only informationShopBot requires about a store is its URL -ShopBotuses AI techniques to learn how to extract informationfrom the store.

    6 Summary, Critique, & Future WorkAlthough the Web is an appealing testbed for the de-signers of intelligent agents, its sheer size, lack of or-ganization, and ubiquitous use of unstructured natu-ral language make it a formidable challenge or theseagents. In this paper we presented ShopBot, a fully-implemented comparison-shoppingagent that operateson the Web with surprising effectiveness.ShopBot au-

    only in the context of the assisted browsing task (Arm-strong et at. 1995; Lieberman 1995), in which theagent attempts to identify promising links by infer-ring the user's interests from her past browsing behav-ior. Finally, there have been attempts to process semi-structured information, but again in a very differentcontext than ShopBot. For example, FAQ-Finder (Ham-mond et at. 1995) relies on the special format of FAQfiles to map natural language queries to the appropri-ate answers.In contrast with ShopBot, virtually all learning soft-ware agents ( e.g., (Maes & Kozierok 1993; Maes 1994;Dent et at. 1992; Knoblock & Levy 1995)) learn abouttheir user's interests, instead of learning about the ex-ternal resources they access. The key exception is theInternet Learning Agent, ILA (Perkowitz et at. 1996).I LA learns to understand external information sourcesby explaining their output in terms of internal cate-gories. ILA learns by querying an information sourcewith familiar objects and analyzing the relationship ofoutput tokens to the query. For example, it queriesthe University of Washington personnel directory withEtzioni and explains the output token 685-3035 ashis phone number .ShopBot borrows from ILA the idea of learning byquerying with familiar objects. However, ShopBotovercomes one of ILA's major limitations. ILA focusedexclusively on the problem of category translation andexplicitly finessed the problem of locating and extract-ing relevant information from a Web site -ILA re-

  • 8/2/2019 A Scalable Comparison-Shopping Agent Fot the Www

    9/10

    to vendors it considers likely to stock the product ata good price..ShopBot relies heavily on HTML. If a vendor pro-vides information exclusively by embedding it ingraphics or using Java, ShopBot will be unable tohandle the vendor. However, future versions ofShopBot should be able to run Java applets and at-tempt to analyze their output, just as ShopBot cur-

    rently does with HTML. We acknowledge that insome cases the output will be too complex or toographical to permit analysis, but hope that the prob-lem will be lessened by the fact that vendors tend toinclude "Iow-technology" formats for the benefit ofusers on slow network links and users whose browsersare not Java-compliant. Finally, in the more dis-tant future, agents may have sufficient value to usersthat they will clamor for vendors to provide agent-friendly interfaces to their stores.All these issues need to be addressed in future re-search. We also plan additional tests of the ShopBotlearner to demonstrate scalability to more domains( e.g., books, consumer electronics, etc.). Each of thesedomains consists of products that can be conciselydescribed with a small number of attributes, so itshould be feasible to develop domain descriptions forthem. We hope to endow ShopBot with more knowl-edge about the various product domains. For example,we plan to provide ShopBot with rough price expecta-tions for different products. A $1.00 price for Encartais probably an error of some sort, not a bargain.We believe that the basic ideas behind the learn-ing algorithm of Section 3.2 are not limited to cre-ating descriptions of product catalogs. We are plan-ning to extend the algorithm to generate "wrappers"( . e. interface functions) for accessing databases whosecontents can be described with relational schemataand whose search forms can be interpreted as re-lational operations restricted with the use of bind-

    ing templates (Rajaraman, Sagiv, & Ullman 1995;Kwok & Weld 1996). For example, we are general-izing our approach to learn the contents of Web-basedYellow Pages services.More generally, we conjecture that the vendor reg-ularities that facilitate ShopBot's success are far fromunique. ShopBot is a case study suggesting that manyWeb sites are semi-structured and thus amenable toautomatic analysis via AI techniques. We anticipatethat regularities will be discovered in other classes ofWeb sites, which will enable intelligent Web agents tothrive. Although the Web is less agent-friendly thanwe might hope, it is less random than we might fear .

    tomatically learns how to shop at online vendors andhow to extract product descriptions from their Webpages. It achieves this performance without sophisti-cated natural language processing, and requires onlyminimal knowledge about different product domains.Instead, it uses heuristic search, pattern matching, andinductive learning techniques which take advantage ofregularities at vendor sites. The most important regu-larity we observed empirically is that vendors structuretheir store fronts for easy navigation and use a uniformformat for product descriptions. Hence, with a modestamount of effort ShopBot can learn to shop at a Webstore.The experiments of Section 4 demonstrate thatShopBot is a useful agent which successfully navigates avariety of stores and extracts the relevant information.The first experiment showed that ShopBot providedsignificant benefit to its users, who were able to findbetter prices in dramatically less time than subjectswithout ShopBot. The second and third experimentsshowed that ShopBot scales to multiple stores and mul-tiple product domains. It shops successfully at all tensoftware stores found by an independent person. Andalthough it was originally designed for software, a newdomain description enabled it to shop for CD's as well,with coverage comparable to that of BargainFinder, anagent custom built for this domain.While our experiments have shown that the ShopBotprototype is remarkably successful, they have also re-vealed a number of limitations. Some of these apply toShopBot as it stands now, and can probably be fixedwith fairly straightforward extensions:.ShopBot needs to do a more detailed analysis ofproduct descriptions. It does not distinguish be-tween upgrades to a product and the product itself.Because the upgrades tend to be cheaper than the

    product, they appear higher in ShopBot's sorted list..ShopBot relies on a very strong bias, which oughtto be weakened somewhat. In particular, ShopBotassumes that product descriptions reside on a singleline, and that product description lines outnumberother line types. A more sophisticated learning al-gorithm would check whether these assumptions areviolated, and if so, resort to a more subtle analysisof the vendor's product descriptions.Other concerns may impact the ShopBot's basic archi-tecture:.ShopBot is limited to stores that provide a search-able index. Some online stores, especially ones withsmaller inventories, provide no index, but use a hier-archical organization instead. ShopBot needs to beable to navigate such hierarchies-.The ShopBot shopper's performance is linear in thenumber of vendors it accesses ( except for the negli-gible cost of sorting the final results). Once an orderof magnitude more merchants populate the Web, itwill be important for ShopBot to restrict its search

    7 AcknowledgementsThanks to Nick Kushmerick and Marc Friedman forhelpful comments on drafts of this paper. Thanks alsoto Brad Chamberlain, Marc Friedman, Keith Golden,Dylan McNamee, Xiaohan Qin, Jonathan Shakes, Al-

  • 8/2/2019 A Scalable Comparison-Shopping Agent Fot the Www

    10/10

    icen Smith, Geoff Voelker, and Jeanette Yee for theirhelp with the experiments. This research was funded inpart by ARPA / Rome Labs grant F30602-95-1-0024,by Office of Naval Research Grant NOOOI4-94-1-0060,by Office of Naval Research grant 92-J-1946, by Na-tional Science Foundation Grant IRI-9303461, by Na-tional Science Foundation grant IRI-9357772, and bya gift from Rockwell International Palo Alto Research.

    Krulwich, B. 1996. The bargainfinder agent: Compar-ison price shopping on the internet. In Williams, J .,ed., Sots and Other Internet Seasties. SAMS.NET .http://bf.cstar.ac.com/bf/.Kwok, C., and Weld, D. 1996. Planning to gatherinformation. In Proc. 14th Nat. Conf on AI.Levy, A. Y., Srivastava, D., and Kirk, T. 1995.Data model and query evaluation in global informa-tion systems. Journal of Intelligent Information Sys-tems, Special Issue on Networked Information Dis-covery and Retrieval5 (2).Lieberman, H. 1995. Letizia: An agent that assistsweb browsing. In Proc. 15th Int. Joint Conf on AI,924-929.Maes, P., and Kozierok, R. 1993. Learning interfaceagents. In Proceedings of AAAI-93.Maes, P. 1994. Agents that reduce work and infor-mation overload. Comm. of the ACM 37(7):31-40,146.Mitchell, T ., Caruana, R., Freitag, D., McDermott,J., and Zabowski, D. 1994. Experience with a learningpersonal assistant. Comm. of the ACM 37(7):81-91.Perkowitz, M., Doorenbos, R., Etzioni, 0., and Weld,D. 1996. Learning to understand information on theinternet: An example-based approach. Journal of In-telligent Information Systems. (to appear).Rajaraman, A., Sagiv, Y., and Ullman, J. 1995.Answering queries using templates with binding pat-terns. In Proceedings of the ACM Symposium onPrinciples of Database Systems.Schlimmer, J., and Hermens, L. 1993. Softwareagents: Completing patterns and constructing userinterfaces. Journal of Artificial Intelligence Research61-89.

    ReferencesAgre, P., and Chapman, D. 1987. Pengi: An imple-mentation of a theory of activity. In Proc. 6th Nat.ConI on AI.Agre, P., and Horswill, I. 1992. Cultural supportfor improvisation. In Proc. lOth Nat. ConI on AI,363-368.Arens, Y., Chee, C. Y., Hsu, C.-N., and Knoblock,C. A. 1993. Retrieving and integrating data frommultiple information sources. International Journalon Intelligent and Cooperative Information Systems2(2):127-158.Armstrong, R., Freitag, D., Joachims, T., andMitchell, T. 1995. Webwatcher: A learning appren-tice for the world wide web. In Working Notes ofthe AAAI Spring Symposium: Information Gatheringfrom Heterogeneous, Distributed Environments, 6-12.Stanford University: AAAI Press. To order a copy,contact [email protected], R. C., and Pilato, S. 1987. Learning syntaxby automata induction. Machine Learning 2:9-38.Dent, L., Boticario, J., McDermott, J., Mitchell, T.,and Zabowski, D. 1992. A personal learning appren-tice. In Proc. lOth Nat. ConI on AI, 96-103.Doorenbos, R. B., Etzioni, 0., and Weld, D. S. 1996.A scalable comparison-shopping agent for the world-wide web. Technical Report 96-01-03, University ofWashington, Department of Computer Science andEngineering. Available via FTP from pub/ai/ atftp.cs.washington.edu.Etzioni, 0., and Weld, D. 1994. A softbot-basedinterface to the Internet. CACM 37(7):72-76.Hammond, K., Burke, R., Martin, C., and Lytinen,S. 1995. FAQ finder: A case-based approach toknowledge navigation. In Working Notes of the AAAISpring Symposium: Information Gathering from Het-erogeneous, Distributed Environments, 69-73. stan-ford University: AAAI Press. To order a copy, [email protected], I. 1995. Analysis of adaptation and envi-ronment. Artificial Intelligence 73(1-2):1-30.Knoblock, C., and Levy, A., eds. 1995. Working Notesof the AAAI Spring Symposium on Information Gath-ering from Heterogeneous, Distributed Environments.Stanford University: AAAI Press. To order a copy,contact [email protected].