search in vain, challenges for internet search

2
January 2003 115 WEB TECHNOLOGIES S earch—originally a simple keyword-lookup functional- ity for files—has become a fundamental Internet service. Yet its failings are well known. Easy Web access to billions of pages of information has a downside. The pages are titled mostly according to their authors’ whims and use sub- tly different terminology that can fool a simple keyword searchsometimes intentionally, sometimes not. In addition, even the best general- purpose search engines do not reach the “invisible Web” of back-end data- bases. Subject-specific search sites can help this situation but take time to maintain, rarely include a sophisticated interface, and seldom provide good coverage even in their topic area. Citeseer (http://citeseer.nj.nec.com/cs), a reference source for computer science research papers, is one successful exception. KEYWORD SEARCH IS NOT ENOUGH Maintaining good topic coverage is only one part of the technical problems that Internet search engines face. A more fundamental problem is how to specify the search. We have all played the new game of guessing good keywords to specify what we actually want from a Web search. However, intelligent commu- nication requires more interaction than keywords alone can support. Anybody would admit that searching a tradi- tional library for a good introduction to traveling abroad solely by the crite- rion that the source has the word “travel” somewhere in its text would not be very effective. When we ask a librarian for help, the specification is typically much more sophisticated, using genres, brief content descrip- tions, and some indication of a pub- lishing timeframe. Even if we are in principle successful in the keyword specification game, we must still find the information we want somewhere in the thousands of docu- ments matching our query. Search engines use various syntactic scoring methods to order the results, such as “page rank” to reflect some measure of a document’s importance or credi- bility, but they offer very little help for refining the search with stylistic speci- fications (“a survey”), topics (“sci- ence”), or temporal constraints (“very recent”). For more sophisticated searches, it is not enough to improve only the analy- sis of Web page topics, stylistic varia- tions, and terminology choices. We also need automated mechanisms to extract information from query speci- fications that include more than just keywords. ONTOLOGIES ARE NOT ENOUGH The Semantic Web comprises recent work to improve the relevance of search results by integrating semantics into Web pages. This work focuses primarily on defining standards based on an ontol- ogy or a metadata model and syntax like the World Wide Web Consortium’s resource description framework (http:// www.w3.org/RDF/). Semantic frame- works are arguably an interesting approach for a single organization or a limited topic, especially where particular functionality exists to restrict the ontol- ogy and thus allow semiautomatic sup- port for query processing. The approach has also proven useful for navigating subject-specific sites. A related approach uses machine learning to extract specific information to populate a knowledge base or seman- tic tags attached to pages. However, nei- ther of these approaches supports the general search with the arbitrary queries that the Web and other large multitopic document collections require. The real challenge for Web technolo- gies is scaling the semantic description techniques. In fact, scaling is what the Internet and Internet searches are all about. While a universal ontology would solve the scaling problem, the design of such ontologies is not a simple task, and their integration and mainte- nance are tedious and error prone. Considering the problems AI research- ers have faced in their attempts to build such an ontology—for example, the Search in Vain: Challenges for Internet Search Henry Tirri, University of Helsinki The next generation of Internet search facilities must support more complex vehicles for interaction than keywords.

Upload: h

Post on 23-Sep-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Search in vain, challenges for Internet search

January 2003 115

W E B T E C H N O L O G I E S

S earch—originally a simplekeyword-lookup functional-ity for files—has become afundamental Internet service.Yet its failings are well

known. Easy Web access to billions ofpages of information has a downside.The pages are titled mostly accordingto their authors’ whims and use sub-tly different terminology that can foola simple keyword search—sometimesintentionally, sometimes not.

In addition, even the best general-purpose search engines do not reachthe “invisible Web” of back-end data-bases. Subject-specific search sites canhelp this situation but take time tomaintain, rarely include a sophisticatedinterface, and seldom provide goodcoverage even in their topic area.Citeseer (http://citeseer.nj.nec.com/cs),a reference source for computer scienceresearch papers, is one successfulexception.

KEYWORD SEARCH IS NOT ENOUGHMaintaining good topic coverage is

only one part of the technical problemsthat Internet search engines face. Amore fundamental problem is how tospecify the search.

We have all played the new game ofguessing good keywords to specifywhat we actually want from a Websearch. However, intelligent commu-nication requires more interaction thankeywords alone can support. Anybody

would admit that searching a tradi-tional library for a good introductionto traveling abroad solely by the crite-rion that the source has the word“travel” somewhere in its text wouldnot be very effective. When we ask alibrarian for help, the specification istypically much more sophisticated,using genres, brief content descrip-tions, and some indication of a pub-lishing timeframe.

Even if we are in principle successfulin the keyword specification game, wemust still find the information we wantsomewhere in the thousands of docu-ments matching our query. Searchengines use various syntactic scoringmethods to order the results, such as“page rank” to reflect some measureof a document’s importance or credi-bility, but they offer very little help forrefining the search with stylistic speci-fications (“a survey”), topics (“sci-ence”), or temporal constraints (“veryrecent”).

For more sophisticated searches, it isnot enough to improve only the analy-sis of Web page topics, stylistic varia-

tions, and terminology choices. Wealso need automated mechanisms toextract information from query speci-fications that include more than justkeywords.

ONTOLOGIES ARE NOT ENOUGHThe Semantic Web comprises recent

work to improve the relevance of searchresults by integrating semantics intoWeb pages. This work focuses primarilyon defining standards based on an ontol-ogy or a metadata model and syntax likethe World Wide Web Consortium’sresource description framework (http://www.w3.org/RDF/). Semantic frame-

works are arguably an interestingapproach for a single organization or alimited topic, especially where particularfunctionality exists to restrict the ontol-ogy and thus allow semiautomatic sup-port for query processing. The approachhas also proven useful for navigatingsubject-specific sites.

A related approach uses machinelearning to extract specific informationto populate a knowledge base or seman-tic tags attached to pages. However, nei-ther of these approaches supports thegeneral search with the arbitrary queriesthat the Web and other large multitopicdocument collections require.

The real challenge for Web technolo-gies is scaling the semantic descriptiontechniques. In fact, scaling is what theInternet and Internet searches are allabout. While a universal ontologywould solve the scaling problem, thedesign of such ontologies is not a simpletask, and their integration and mainte-nance are tedious and error prone.Considering the problems AI research-ers have faced in their attempts to buildsuch an ontology—for example, the

Search in Vain:Challenges forInternet SearchHenry Tirri, University of Helsinki

The next generation of Internetsearch facilities must supportmore complex vehicles forinteraction than keywords.

Page 2: Search in vain, challenges for Internet search

NEXT-GENERATION TECHNIQUESLarge-scale information-retrieval

tasks that organize pages hierarchicallyby topic improve both the relevance ofresults and the efficient use of compu-tational resources. Returning results bytopic enhances the end-user experienceand also supports distributed searchtechniques. In restricted research set-tings, integrating synonyms and topicsinto the search system can improveresult quality. But such an approach hasmany pitfalls, and naïve integration canin fact decrease result quality.

One promising way to avoid theseproblems involves a network of sub-ject-specific nodes in a distributed, hier-archical system (http://cosco.hiit.fi/search). Each “subject node” automat-ically builds its own hierarchies for ter-minology, genres, and topics in a topicmap. Such a map is only a construct,not necessarily a semantic ontology; itexists to power the search and naviga-tion process at a subject-specific site.Eventually, such a “search network”could evolve into a peer-to-peer systemwith highly distributed query process-ing using personalized query histories.

T he “Next-Generation SearchExample” sidebar describes howthese techniques could evolve to

improve search results. In addition,Search Engine Watch (http://searchenginewatch.com) provides agood site for information, reviews, andlinks to new developments in searchengines. �

Henry Tirri is a professor of computerscience at the University of Helsinkiand a visiting professor at StanfordUniversity. Contact him at [email protected].

116 Computer

W e b T e c h n o l o g i e s

CYC project—we will likely find waysto automatically generate enoughsemantic information to yield moremeaningful search results long beforewe define a universal ontology.

GENERIC RELEVANCE IS NOT ENOUGH

Today’s Internet search engines workat a lexical level, performing string pat-tern matching to stored documents andaugmenting the results with link analy-sis, frequency counting, and elementarystructural analysis such as weightedtitle words. These engines thus providefirst-level information filtering. Theuser, however, must still perform mostof the relevance filtering.

Experience with the current Webdemonstrates the problems of gainingmeaningful access to a universal hyper-text environment where “anything canlink to anything.” Semantic Web on-tologies represent a step toward semiau-tomated relevance filtering, but they

employ a general notion of relevancethat applies to the entire user population.

Relevance, on the other hand, is typ-ically person-dependent, so personal-ization will become important infuture search engines. Most currentpersonalization approaches place theburden of constructing a user profileon the user. This approach is viable insome restricted domains, but it doesnot scale for Internet search problems.In addition, users may be vocal indemanding what they want, but theyoften have difficulty defining what theyactually need or how they behave.

Using modeling techniques to auto-matically construct a profile based onuser behavior is a scalable approach topersonalization. It offers a value-addedfiltering service that requires no extrawork from the user. In fact, users neednot even know that an adaptive client-server mechanism exists, though otherconcerns such as privacy issues mightbe a reason for revealing it.

Editor: Sumi Helal, Computer and Informa-tion Science and Engineering Dept., Univer-sity of Florida, P.O. Box 116125, Gainesville,FL, 32611-6120, [email protected]

Next-Generation Search Example

What could next-generation search facilities do? Consider a general queryof a history database, such as “Roman empire.” The search engine might sug-gest specializing the query with a topic such as “military achievements,” “pol-itics,” or “Christianity,” or with a genre such as “discussion groups,” “intro-ductions,” or “book reviews.” At a general level, these specifications reflectsuggestions that a knowledgeable librarian might make. A more specific querymight identify the topic as free text—for example, “I would like to find outabout the daily life of people in the Roman empire”—and select a genre froma pulldown menu—for example, “Research.”

The search engine might then order the results under headings such as “enter-tainment,” “moral norms,” and “religious ceremonies.” Some pages mightcontain only a subsection relevant to the query. The results list would linkdirectly to this subsection. This presentation reflects analysis of the pages in acollection at a detail level that we could not expect a librarian to master.

Of course, keywords will always remain important, so the search could alsoinclude specializations such as “keyword = Bacchus.”

Suppose a user selects a page that details the religious practices of 1st cen-tury B.C.E. priestesses and also considers their portrayal in recent books andfilms. Then an option to “find more pages like this” could use the page’s majorsemantic terms to index from the full set of 10 million pages in the topic-spe-cific “history engine” into a subset of, say, 400 related pages. The search enginewould then rank these 400 using a full-similarity metric and would again breakout the results by topic. The approximate full-text similarity matching wouldthen reflect multiple aspects of the selected page’s content.