seal-ii — the soft spot between richly structured and ...seal-ii — the soft spot between richly...

SEAL-II — The Soft Spot between Richly Structuredand Unstructured Knowledge

Andreas Hotho(Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany

[email protected])

Alexander Maedche(FZI Research Center for Information Technologies

at the University of Karlsruhe, 76131 Karlsruhe, [email protected])

Steffen Staab(Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany

andOntoprise GmbH, Haid-und-Neu Straße 7, 76131 Karlsruhe, Germany

[email protected])

Rudi Studer(Institute AIFB, University of Karlsruhe, 76128 Karlsruhe, Germany

andFZI Research Center for Information Technologies

at the University of Karlsruhe, 76131 Karlsruhe, Germanyand

Ontoprise GmbH, Haid-und-Neu Straße 7, 76131 Karlsruhe, [email protected])

Abstract: Recently, the idea of semantic portals on the Web or on the intranet has gained pop-ularity. Their key concern is to allow a community of users to present and share knowledge in aparticular (set of) domain(s) via semantic methods. Thus, semantic portals aim at creating high-quality access — in contrast to methods like information retrieval or document clustering thatdo not exploit any semantic background knowledge at all. However, by way of this constructionsemantic portals may easily suffer from a typical knowledge management problem. Their initialvalue is low, because only little richly structured knowledge is available. Hence the motivation ofits potential users to extend the knowledge pool is small, too.

We here present SEAL-II, a methodology for semantic portals that extends its previous version,by providing a range of ontology-based means for hitting the soft spot between unstructuredknowledge, which virtually comes for free, but which is of little use, and richly structured knowl-edge, which is expensive to gain, but of tremendous possible value. Thus, we give the portalbuilder tools and techniques in an overall framework to start the knowledge process at a semanticportal. SEAL-II takes advantage of the ontology in order to initiate the portal with knowledge,which is more usable than unstructured knowledge, but cheaper than richly structured knowledge.

Key Words: Knowledge Portal, OntologyCategory: H.0

Journal of Universal Computer Science, vol. 7, no. 7 (2001), 566-590submitted: 1/5/01, accepted: 6/6/01, appeared: 28/7/01 Springer Pub. Co.

1 Introduction

With SEAL (Staab et al., 2000; Staab & Maedche, 2001; Maedche et al., 2001) we havepresented a comprehensive architecture for a semantic portal offering a broad range oftools for improving the benefit–effort ratio of semantic portals. This technology, viz.the easy and adequate presentation and exchange of information based on ontologies inconjunction with little additional editing effort, offers itself to knowledge managementtasks like corporate history analysis (Angele et al., 2000) or skill management (Sureet al., 2000).

The life cycle of such a semantic portal spans three intertwined subcycles (cf. (Staabet al., 2001)): First, in a so-calledknowledge meta processthe domain of the applicationis modelled in an ontology. Second, in an initialknowledge instantiation phaseone triesto scrape whatever knowledge is available from legacy systems, such as database con-tents. Thus, the portal may be started with sufficient knowledge to motivate employeesto use the system. Third, in theknowledge processproper, i.e. the process undertakenby all users of the system, knowledge on the portal is used and new knowledge is con-tributed to the portal.

With SEAL-II we aim at a substantial extension of SEAL working on two importantissues:

1. Motivation: People tend to use systems on a tit-for-tat basis. They tend not toinvest work when they cannot recognize immediate benefit from it. Thus, semanticportals, or KM systems in general, that really start from scratch are easily ignored,as no one leaves the trap in this prisoners’ dilemma.

2. Grey-shades along the benefit–effort scale:Knowledge items in SEAL are forcedinto one of the two categoriesvaluableor useless. There is little balance betweennot having knowledge items at all (you could still do information retrieval) andinvesting efforts to have them in the ontology-based KM system. In addition, how-ever, one would like to offer means that allow to trade-off more finely on the scaleof benefit divided by effort.

SEAL-II tackles the soft spot between unstructured knowledge and richly structuredknowledge. First, it provides new possibilities to create knowledge when instantiatingthe semantic portal while taking full advantage of the ontology created in the knowledgemeta process. Thus, one may motivate people to actually use the system and contributeto it. Second, it accounts for the gray shades of knowledge regions that live too fast(e.g. are updated too often) to be solidified into ontology-based facts. The latter typeof facts should not only be accessible by information retrieval techniques, but also byexploratory means.

Thus, on a scale between richly and unstructured knowledge we place a set of (par-tially new) techniques between the two extremes (cf. Figure 1). At the unstructured end,one finds techniques such as information retrieval, document clustering and keyword

567Hotho A., Maedche A., Staab S., Studer R.: Seal-II ...

matches. At the other extreme, we have developed a set of techniques for providingrichly structured knowledge (Staab et al., 2000; Staab & Maedche, 2001). This paper isabout techniques for filling the void between the two extremes and about an architecturethat allows for flexible scaling on the degree of structure of information. The overall ap-proach presented here has been instantiated asFZI-Broker , a knowledge broker for theFZI knowledge management research group.

asdfef

aedsfj

aiwew

keakaf

eew

OntologyOntology

Texts,

IR,

Keyword

matching

...

Relational (Meta-)Data

Logic

Querying

...

O-based Crawling

O-based Clustering

Conceptual Open

Hypermedia

...

http://myuri/root

http://hisuri/root

pred://know

s

http://myuri/root

http://hisuri/root

pred://know

s

http://myuri/root

http://hisuri/root

pred://know

s

Figure 1: Coping with different needs: KM techniques for structured and unstructuredinformation

In the following, we will first describe the architecture of SEAL-II — modifyingslightly the original proposal. Thereafter, we sketch the initialization process of SEAL-II, i.e. the setup of the general approach into a particular application (Section 3). Sec-tion 4 outlines the underlying representations in the knowledge warehouse. We continuewith a description of the ontology-based crawler, a tool that allows to grow the knowl-edge base from a number of seed HTML pages by crawling HTML texts and metadata.Section 6 elaborates ontology-based clustering, a technique that we have developed inorder to provide subjective, concept-based views onto crawled documents. Conceptualopen hypermedia extends conventional linking by ontology-construed relations (Sec-tion 7). Before we conclude we shortly describe our instantiations of SEAL-II, the FZI-Broker and the HR-TopicBroker, and give a short survey of related work.

568 Hotho A., Maedche A., Staab S., Studer R.: Seal-II ...

2 Architecture

We here show how the SEAL architecture has been extended to tackle the “soft spotbetween unstructured knowledge and richly structured knowledge”. In this section, weelaborate on the general architecture ofSEAL-II that extends the SEAL architecturepresented in (Maedche et al., 2001) and explain the functionalities of its core compo-nents in detail in the subsequent sections. Figure 2 depicts the overall architecture thatunderliesSEAL-II. The following main components are contained in the SEAL-II ar-chitecture:

– Knowledge Warehouse:The components ofSEAL-II are built around the on-tology that is stored in the knowledge warehouse. In general, the function of theknowledge warehouse is to store a variety of different types of structured and un-structured knowledge. The major parts of the warehouse are ontologies, fact knowl-edge, and document representations. The knowledge warehouse does not distin-guish between schema and non-schema data as commonly known from typicalrelational databases. A detailed description of the different kinds of data that arestored in the knowledge warehouse is given in section 4.

– Inference Engine: The Ontobroker system (Decker et al., 1999) is a deductive,object-oriented database system. It is based on F-Logic allowing to describe ontolo-gies, rules and facts. Beside other usage, it is used as an inference engine to derivenew knowledge based on the existing fact knowledge contained in the knowledgewarehouse.

– Extractor: The extractor analyzes natural language texts in order to recognize con-cepts. In our current implementation we use a very simple processing mechanismfor recognizing concepts. The porter stemming algorithm (Porter, 1980) is used tocompute word stems from a given document. The lexicon (cf. section 4) maps wordstems to concepts. Thus, a look-up in the lexicon retrieves the concepts that are re-ferred by a specific word stem. The reader may note that we are aware that a morecomplex processing strategy based on shallow linguistic processing techniques1

may improve the overall quality of our system.

– Ontology-focused Crawler: An important component in the architecture is theontology-focused document and (meta-)data crawler. The crawler takes the ontol-ogy from the knowledge warehouse and uses it for judging relevancy of documentsin order to control the search in the web space. Thus, each document is representedby so-called concept vectors. Additionally, the crawler is able to collect relational

1 The GETESS (German Text Exploitation and Search System) project pursues the tight integra-tion between linguistic processing and ontological background knowledge (cf. (Staab et al.,1999)).


Knowledge Warehouse

Clustering

Presentation EnginePresentation EngineCrawler

Extractor

BrowserWWW / Intranet

Template Navigation Semantic

Query

Person-

alization

Inference

Engine

Semantic

Ranking

Figure 2: Architecture

metadata that have been generated by community members, e.g. by using a seman-tic annotation tool (cf. (Staab et al., 2000)). The ontology-focused crawler is furtherdiscussed in section 5.

– Ontology-based Clustering:A specific feature of the document part of SEAL-IIis the ontology-based clustering component described in section 6. Ontology-basedclustering computes structure for knowledge available in unstructured documents.It clusters similar documents into groups while taking advantage of backgroundknowledge from the ontology.

– Presentation:At the front end we use a simple web application that is based on theidea of conceptual hypermedia. The interface is automatically generated by exploit-ing facts from the knowledge warehouse (cf. section 7). Thereby, the presentationcomponent accesses the following modules available in SEAL-II:

� Template: As mentioned earlier SEAL-II allows the contribution of informa-tion by users. The template module generates an HTML form for each con-cept that a user may instantiate and relate with other concepts. In general,users may add information in the following different ways:First, they may adddocuments (refering to them by URLs) to the knowledge warehouse. Thesedocument serve as input for the relevancy search for further documents us-


ing the crawler.Second, they may define formal instances of concepts con-tained in the ontology, e.g. they may define a concrete PROJECTlike Onto-

Knowledge as fact knowledge described in a given document.Third, theymay define attributes of and relations between instances, e.g. it may be inter-esting to define the attributeURL of the instanceOntoKnowledge, namelyhttp://www.ontoknowledge.org or to define the relationPARTICIPATES

between the instancesAIFB andOntoKnowledge.

� Semantic Query & Navigation: Beside the hierarchical, tree-based hyperlinkstructure which corresponds to the hierarchical decomposition of the domain,the navigation module enables complex graph-based semantic hyperlinking,based on ontological relations between concepts in the domain. The concep-tual approach to hyperlinking is based on the assumption that semanticallyrelevant hyperlinks from a web page correspond to conceptual relations, suchasmemberOf or hasPart, or to attributes, likehasName. Thus, instances inthe knowledge base may be presented by automatically generating links to allrelated instances.

� In addition to the presentation modules described above, SEAL-II accesses thetwo componentspersonalizationandsemantic ranking that are described indetail in (Maedche et al., 2001).

3 Make Application

The previous section shows how the different ontology-based modules contained inSEAL-II interact. In this section, we want to sketch how one may configure and in-stantiate the SEAL-II architecture. The basic idea of SEAL-II is to extend basic ser-vices such as given, e.g., with ZOPE2. ZOPE is an web application server that providesbasic mechanisms for web site management including, e.g., user administration, undofunctions, database integration, user administration, HTML programming environment,plug-in API, etc. This is an extensible list, which may be extended by programmers thatpackage their modules into so-called “products”. People that instantiate the applicationserver simply select the appropriate products and add dynamic HTML pages, in par-ticular HTML layout. The ultimate goal of SEAL-II is to provide such configurationfacilities not only on the “product” level, i.e. the level of standard website processes,but also on the content level. For the latter, the configuration and selection of ontolo-gies play the crucial role that determines the structuring of contentwithin the different“products”.2 cf. http://www.zope.org


3.1 Engineer Ontology

The conceptual backbone of our SEAL-II approach is the ontology. For instantiatingSEAL-II, one has to model the concepts and relations relevant in a specific domain. AsSEAL-II has been maturing, we have developed a methodology for setting up ontology-based knowledge systems (cf. (Staab et al., 2001)). This methodology is supported to alarge extent by our ontology engineering environment ONTOEDIT (cf. Figure 3) and itssubmodule ONTOKICK.

Figure 3: OntoEdit Screenshot with SWRC ontology

ONTOEDIT provides the basic means for building concept and relation hierarchiesas well as describing inference rules. For instance, in our snapshot one may recognizethat in the Semantic Web Research Community Ontology3, which serves as the basisfor our FZI-Broker, ACADEMICSTAFF is a subclass of EMPLOYEE. ACADEMICSTAFF

has, e.g., the relationCOOPERATESWITH. Some of its relations — the ones in the greyshade — are inherited from EMPLOYEE. The submodule ONTOKICK allows to alignthe ontology specification documents with their corresponding concepts and relations.It allows to investigate domain texts in order to facilitate the discovery of concepts and3 cf. http://ontobroker.semanticweb.org/ontos/swrc.html


relations from the texts by the ontology engineer.

3.2 Make Ontology-based Applications

Essentially, we aim at promoting the shift from programming towards modeling andreuse of existing software. In particular on semantic portals many of the core processes— knowledge contribution, knowledge sharing, etc. — are nearly standard. Similarto knowledge acquisition methodologies and tools like Protege (Eriksson et al., 1994;Grosso et al., 1999), which have been used to (semi-)automatically generate fact ac-quisition interfaces from ontologies, ontologies may be used to (semi-)automaticallypresent to and acquire facts from the users of a semantic portal. In addition, one mustonly add:

– Metadata about the ontology: In order to select important concepts for differentviews one may describe meta information about ontology concepts and relations,e.g. the relative importance of a concept for presentation purposes. Alternatively,one may have several ontologies for different products or different views onto thesame ontology for varying products.

– Layout: The more that knowledge processes and configuration steps become stan-dard, the larger is the share of overall efforts dedicated to layout concerns.

Some part of this overall vision has already been realized. Though, we still have togo a considerable distance for a ready-to-go solution that incorporates the full potentialof “ontology application servers”, we may now rather quickly instantiate SEAL-II —just by engineering a new ontology and by instantiating basic parameters like seed pagesfor crawling.

4 Knowledge Warehouse

The function of the knowledge warehouse is to store a variety of different types ofstructured and unstructured knowledge. The major parts of the warehouse are:

– Ontologies: The transparent storing of ontologies and corresponding data allows tocombine both in intelligent ways such as outlined below.

– Fact knowledge: Fact knowledge is available in our warehouse like in an object-oriented database with the ontology as the corresponding schema. In addition, how-ever, we allow facts to have dual nature, as fact knowledge may also be sometimesused as ontology knowledge. This is helpful, e.g. for describing concepts on the on-tological level, but also to assign to them meta-statements that describe, e.g., theirrelative importance for presentation to the user.

– A Document representationis factual knowledge arranged in a manner suitable forretrieval or other kinds of processing. We employ three major types of documentrepresentations, as described below.


4.1 Ontology

Following Tom Gruber, we understand ontologies as a formal specification of a sharedconceptualization of a domain of interest to a group of users (cf. (Gruber, 1993)). Simi-lar to the term “information system in a wider sense” this notion of ontologies containsmany aspects, which either cannot be formalized at all or which cannot be formalizedfor practical reasons. For instance, it might be too cumbersome to model the interestsof all the persons involved. Concerning the formal part of ontologies (or “ontology inthe narrow sense”), we employ a two-part structure. The first part (cf. Definition 1) de-scribes the structural properties that virtually all ontology languages exhibit. Note thatwe do not define any syntax here, but simply refer to these structures as a least commondenominator for many logical languages, such as OIL (Fensel et al., 2001) or F-Logic(Kifer et al., 1995):

Definition 1. LetL be a logical language having a formal semantics in which inferencerules can be expressed. Anabstract ontologyis a structureO := (C;�C ; R; �;�R; IR)

consisting of

– two disjoint setsC andR whose elements are calledconceptsandrelations, resp.,

– a partial order�C onC, calledconcept hierarchyor taxonomy,

– a function�:R! C � C calledsignature,

– a partial order�R onR wherer1 �R r2 implies�(r1) �C�C �(r2) , for r1; r2 2R, calledrelation hierarchy.

– and a setIR of inference rules expressed in the logical languageL.

The functiondom:R ! C with dom(r) := �1(�(r)) gives thedomainof r, and thefunctionrange:R! C with range(r) := �2(�(r)) gives itsrange.

At the interface level we use an explicit representation of the lexical level.4 There-fore, we define a lexicon for our abstract ontologyO as follows:

Definition 2. A lexiconfor an abstract ontologyO := (C;�C ; R; �;�R; IR) is a struc-tureLex := (SC ; SR;Ref C ;Ref R) consisting of

– two setsSC andSR whose elements are calledsigns (lexical entries) for conceptsandrelations, resp.,

– and two relationsRef C � SC � C andRef R � SR � R calledlexical referenceassignments for concepts/relations, resp.

4 Our distinction of lexical entry and concept is similar to the distinction of word form andsynset used in WordNet (Fellbaum, 1998). WordNet has been conceived as a mixed linguistic/ psychological model about how people associate words with their meaning.


Based onRefC

, we define, fors 2 SC ,

Ref C(s) := fc 2 C j (s; c) 2 Ref Cg

and, forc 2 C,Ref �1

C(c) := fs 2 S j (s; c) 2 Ref Cg :

Ref R andRef �1R

are defined analogously.

The abstract ontology is made concrete through naming. Thus:

Definition 3. A (concrete)ontology(in the narrow sense) is a pair(O;Lex ) whereOis an abstract ontology andLex is a lexicon forO.

4.2 Fact knowledge

Fact knowledge may be represented in a variety of ways, such as by tuples from re-lational database tables, by F-Logic statements, or in the resource description format(RDF)5, the W3C standard for describing metadata on the WWW. For actual storage,we have the possibility to either store them directly in conventional object-relationaldatabases or in a reified format which builds a layer on top of relational databases.

4.3 Document Representation

We employ three types of document representations distinguishing between term vec-tors, concept vectors and metadata representations of documents.

4.3.1 Term vectors

Term vectors (frequently called “bag of words”) describe a document like shown inFigure 4 as a bag of document terms. I.e. one represents the terms that appear in a doc-ument together with their frequency in this document. Given the document in Figure 4,the corresponding vectort := (1; 2; 0 : : :)T means that the terms “distributed”, “organi-zation”, and “publication” appear once, twice, and zero times, respectively, in the givendocument. Thereby, as is standard, stop terms like “the” and “and” or HTML markuplike “<title>” or “<p>” are filtered out.

4.3.2 Concept vectors

Term vectors are ideally suited for standard information retrieval methods, such asknown from search engines like AltaVista. In restricted domains, however, ontologiesmay give additional power to retrieval and other tasks by employing available back-ground knowledge. For this purpose, the lexicon is used to map terms to concepts.5 http://www.w3.org/RDF/


Figure 4: Sample of crawled website

Thereby, synonyms may be resolved to refer to a common unique concept. However,word sense ambiguities may make it necessary to represent multiple options, e.g. onesthat let “school” stand for the organization vs. the building. Word sense disambigua-tion methods may be used, however, the current state-of-the-art tools are still ratherimprecise.

4.3.3 Metadata representations of documents

Instead of representing document contents, one may gather metadata thatdescribethecontents. The standard format for metadata on the Web is the resource descriptionframework (RDF). Essentially, metadata may simply be stored as fact knowledge thatis indexed by a document identifier.

5 Ontology-Focused Crawling

A crawler is a program that retrieves Web pages, commonly for use by a search engine(Pinkerton, 1994) or a Web cache. Roughly, a crawler starts off with the URL for aninitial pageP0. It retrievesP0, extracts any URLs in it, and adds them to a queue ofURLs to be scanned. Then the crawler gets URLs from the queue (in some order), andrepeats the process. Every page that is scanned is given to a client that saves the pages,creates an index for the pages, or summarizes or analyzes the content of the pages.


Our application heavily depends on the detection of relevant data on the web oran intranet. With the rapid growth of the world-wide web new challenges for general-purpose crawlers are given (cf. (Chakrabarti et al., 1999) for a comprehensive frame-work). Therefore, we have extended classical general-purpose crawlers as currentlyused in two directions:

1. We follow a focused crawling approach similar to (Chakrabarti et al., 1999). Theexplicit focus for crawling is given by the ontology.

2. We extend the document crawlers with a component that allows the crawling ofrelational metadata that may be defined in documents following a given ontology.

5.1 Ontology-focused Document Crawling

The ontology-focused document crawler builds on the general crawling mechanism de-scribed above. It extends general crawling by using ontological background knowledgeto focus the search in the web space. It takes as input a user-given setA of seed docu-ments (in the form of URLs), a core ontologyO, a maximum depth leveldmax to crawland a minimal document relevance valuermin. The resulting output of the crawler is aset of focused documentsD. The crawler downloads each document contained in the setA of start documents. Based on the results of the extraction mechanisms we computefor each document a relevancy measurer(d). In its current implementation this rele-vancy measure is equal to the overall number of concepts referenced in one document,defined as follows:

Definition 4. Let Ld := fl 2 SC jl 2 dg andCd := fc 2 C j9l 2 Ld : (l; c) 2

Refcg: The document relevance value for a documentd 2 D is given by

r(d) = jCdj: (1)

If the relevancyr(d) exceeds the user defined thresholdrmin, the specific docu-ment will be added to the set of focused documentsD 6. All hyperlinks starting from adocumentd are recursively analyzed. In addition the crawling process is restricted witha maximum depth leveldmax for a given start documentd, i.e. from a seed documentmaximallydmax recursions are followed for crawling.

5.2 Crawling relational metadata

As already mentioned above, our crawler should also extract fact knowledge in theform of relational metadata if available on web pages. Therefore, we have developed6 The reader may note that this strategy for measuring relevancy may be further refined, e.g. with

normalized counts or with the inclusion of ontological background knowledge, e.g. containedin the concept taxonomy.


the RDF Crawler7, a basic tool that gathers interconnected fragments of RDF from theWeb and builds a local knowledge base from this data. The RDF crawler builds on RDF.In general, RDF data may appear in Web documents in several ways. We distinguishbetween(i) pure RDF (files that have an extension like ”*.rdf”),(ii) RDF embedded inHTML and (iii) RDF embedded in XML. Our RDF Crawler relys on Melnik’s RDF-API8 that can deal with the different embeddings of RDF described above.

6 Ontology-based Clustering

The objective of document clustering is to present the user a high-level structured viewfor navigation through mostly unknown terrain. Thus, he may find associations betweenseemingly unrelated documents and get an intuition about the structure of the documentrepository that has not been crafted manually. Thereby, the document clustering algo-rithms work by determining (dis-)similarity of documents, e.g. based on the Euclideanor based on the cosine distance of document term vectors or document concept vec-tors. More similar documents are put into the same cluster and documents that appearin different clusters are considered to be rather dissimilar. Together, these features, i.e.unsupervised structuring of a document repository, which in our case has been crawledfrom the Web, are highly desirable to acquaint the user of the portal with the ratherunstructured document knowledge.

Though standard mechanism for text clustering are well known, they typically sufferfrom several inherent problems. First, the term/concept vectors are too large. Thereforeclustering takes place within a high-dimensional vector space leading to undesirablemathematical consequences, viz. all document pairs are similarly (dis-)similar. Thus,clustering becomes impossible and yields no recognizable results (Beyer et al., 1999).Second, it is hard for the user to understand the differences between clusters. Third,background knowledge does not influence the interpretation of structures found by theclustering algorithms. Therefore, we have developed an ontology-based clustering ap-proach that (partially) solves these problems (Hotho et al., 2001).

Document # 1 (“OTK”) 2 (“AIFB Publications”)3 (“IICM Publications”)

PUBLICATION 0 1 1KNOWLEDGEMANAGEMENT 2 2 1

DISTRIBUTED ORGANIZATION 1 0 1

Figure 5: Concept vector representations for 3 sample documents (cf. Figs. 4 and 6)

7 RDF Crawler is freely available for download at:http://ontobroker.semanticweb.org/rdfcrawler.

8 http://www-db.stanford.edu/�melnik/rdf/api.html


Figure 6: Sample web pages

In the following we give a detailed example. In Figure 5 you find a sample of (ab-breviated) concept vectors representing the web pages depicted in Figures 4 and 6. InFigure 7 you see the corresponding concepts highlighted in an excerpt of the FZI-Brokerontology (also cf. section 8 on FZI-Broker). Our simplifying example shows the prin-cipal problem of vector representations of documents: The tendency that spurious ap-pearance of concepts (or terms) rather strongly affects the clustering of documents.The reader may bear in mind that our simplification is so extensive that practically itdoes not appear in such tiny settings, but only when one works with large representa-tions and large document sets. In our simplifying example the appearance of conceptsPUBLICATION, KNOWLEDGE MANAGEMENT, and DISTRIBUTED ORGANIZATION isspread so evenly across the different documents that all document pairs exhibit (moreor less) the same similarity. Corresponding squared Euclidian distances for the exampledocument pairs (1,2), (2,3), (1,3) leads to values of 2, 2, and 2, respectively, and, hence,to no clustering structure at all.

When one reduces the size of the representation of our documents, e.g. by pro-jecting into an subspace, one focuses on particular concepts and one may focus onthe significant differences that documents exhibit with regard to these concepts. For


PERSON

THESIS

KNOWLEDGE

MANAGEMENT

ROOT

EVENT

........

TOPIC

RESEARCH

TOPIC

DISTRIBUTED

ORGANIZATION

........

PUBLICATION

JOURNAL ........

........

Figure 7: A sample ontology

instance, when we project into a document vector representation that only considersthe two dimensionsPUBLICATIONS and KNOWLEDGE MANAGEMENT, we will findthat document pairs (1,2), (2,3), (1,3) have squared Euclidean distances of 1, 1, and2. Thus, axis-parallel projections like in this example may improve the clustering situ-ation. In addition, we may exploit the ontology. For instance, we select views from thetaxonomy, choosing, e.g., RESEARCHTOPIC instead of its subconcepts KNOWLEDGE

MANAGEMENT and DISTRIBUTED ORGANIZATION. Then, the entries for KNOWL-EDGE MANAGEMENT and DISTRIBUTED ORGANIZATION are added into one vectorentry resulting in squared Euclidean distances between pairs (1,2), (2,3), (1,3) of 2, 0,and 2, respectively. Thus, documents 2 and 3 can be clustered together, while document1 falls into a different cluster.

In (Hotho et al., 2001) we have developed an algorithm ”GenerateConceptViews”as a preprocessing step for clustering. GenerateConceptViews chooses a set of axis-parallel and ontology-based projections leading to modified document representations.Conventional clustering algorithms like K-Means may work on these modified repre-sentations producing improved clustering results. Because the size of the vector rep-resentation is reduced, it becomes easier for the user to track the decisions made bythe clustering algorithms. Because there are a variety of projections, the user maychoose between views. For instance, there are projections such that publication pagesare clustered together and the rest is set aside or projections such that web pages aboutKNOWLEDGE MANAGEMENT are clustered together and the rest is left in another clus-ter. The choice of concepts from the taxonomy thus determines the output of the clus-tering result and the user may use a view like Figure 7 in order to select and understanddifferences between clustering results.

Concluding this section, we may claim that the particular merits of ontology-basedclustering are the improvement of clustering results (as can be proved by standard evalu-ation measures), the explanation of results, and the consideration of background knowl-edge. The general merits of ontology-based clustering in SEAL-II are its capabilitiesto relate the ontology with unstructured knowledge. Thus, it provides the portal with


abilities for unsupervised, automatic structuring of portal contents. Users of SEAL-IImay approach large document sets more conveniently and efficiently select views thatcorrespond to concepts of actual concern to the user.

7 Open Conceptual Hypermedia

SEAL-II relies on ontologies to structure the available information and to define ma-chine-readable metadata for the information sources at hand. By exploiting the conceptsand relations being defined in the ontology, links between different knowledge elementsthat are presented to the user may be handled as first class objects. In that way, linksare managed separately from the contents, a characteristic of Open Hypermedia Sys-tems (Osterbye & Wiil, 1996). Furthermore, the relations of the ontology provide aconceptual underpinning of these links and thus pave the way to so-called ConceptualHypermedia Systems (cf. (Nanard & Nanard, 1991)).The conceptual model that is defined by the ontology may directly be transformed intofunctionalities of the portal user interface:

– Conceptual navigation: In contrast to conventional hypermedia systems, i.e. hyper-media systems that just provide syntactic links between hypermedia documents,links within Conceptual Hypermedia Systems come with a clearly defined seman-tics, as specified by the corresponding relation of the ontology. In that way, usersmay navigate along conceptual links that relate knowledge elements to each other.Thus, semantic navigation in the knowledge space is achieved.

– Semantic querying: Based on the concepts of the ontology and the associated re-lations users may define semantic queries. i.e. queries that come with a clearly de-fined semantics. As a consequence, the portal delivers an exact answer containingthe knowledge elements the user is interested in.

A second advantage of an ontology-based approach is the exploitation of the ontol-ogy for the generation of the portal. This is a crucial aspect since the manual construc-tion of a portal is a time- and resource-consuming task. The automatic generation of theportal has to address the following aspects (cf. Figure 8):

– Navigation structure: The concepts of the ontology and their embedding into a tax-onomy provide a basic conceptual navigation structure for the portal. E.g. in Figure8 we can see on the left hand side the concept taxonomy. The user may browse inthis concept hierarchy until she has found the concept she is interested in. In ourexample, the user has selected the concept ORGANIZATION.

– Concept instances: In the middle part of the screen, the portal displays the instancesbeing available in the portal for the selected concept. In our example, we see e.g.the instance of organizationUniversitat Karlsruhe.


Figure 8: Screenshot FZI-Broker

– Relational (Meta-)data: The conceptual navigation structure is supplemented withrelations being defined in the ontology for the selected concept. I.e. the portalpresents exactly those relations that are available in the current state of the usernavigation. E.g. in Figure 8 we see all the relations being defined for the conceptORGANIZATION. By selecting one of the displayed relations the user is able tospecify instances of this relation for the selected concept instance. In our example,the user could e.g. specify a (meta-)data value for theCARRIES OUT-relation for theprojectOntoKnowledge. An obvious advantage of this approach is the fact thatthe user is only able to specify relational (meta-)data facts that are consistent withthe conceptual model, i.e. the ontology.

As can be seen from the example in Figure 8, SEAL-II supports the generation ofportals with limited complexity from a given ontology. Since the SEAL-II frameworkis domain independent, the generation of a new portal just requires the constructionof a new ontology. Obviously, one might think of more sophisticated means, e.g. oftailoring the interface to specific user profiles or of defining a subset of the ontology asthe conceptual navigation ontology. Such aspects are discussed in the conclusion.


8 Instantiations of SEAL-II

In its current version, the SEAL-II architecture has been instantiated in two applica-tions: FZI-Broker9 and HR-TopicBroker. HR-TopicBroker is a system that supports thelocation of human resource (HR) topics in relevant web pages. Additionally, it allowsthe joint definition of a knowledge base by human resource managers sharing relevantinformation (e.g. contact addresses).

Along the same lines FZI-Broker as an realization of SEAL-II is a system that isinternally used by the FZI knowledge management research group. A screenshot ofthe FZI-Broker application is depicted in Figure 8. The FZI-Broker ontology is basedon the Semantic Web Research Community (SWRC) ontology. The ontology containsinformation like ACADEMICSTAFF is a subclass of EMPLOYEE and ACADEMICSTAFF

has, e.g., the relationCOOPERATESWITH.The underlying idea of FZI-Broker is that members of the research group share a

common ontology with common interests. Thus, FZI-Broker supports the instantiationof these common interests. First, it supports a document-centric view on the ontology. Ituses the focused document crawler and the clustering mechanism to automatically offerviews on web documents. Additionally, as depicted in Figure 8 it allows the manualdefinition of fact knowledge in the form of concept and relation instantiations followingthe ontology definitions.

9 Related Work

This section positions our work, the SEAL-II approach, in the context of existing webportals and also relates our work to other basic methods and tools that are or could bedeployed for the construction of semantic community web portals.

Related Work on Portals.One of the well-established web portals is Yahoo10. In con-trast to our approach Yahoo only utilizes a very light-weight ontology that solely con-sists of categories manually arranged in a hierarchical manner. Yahoo offers keywordsearch in addition to hierarchical navigation. SEAL-II offers a much wider range oftechnology for access to documentsand facts.

The COHSE system (Carr et al., 2001) is a system integrating notions from Openand Conceptual Hypermedia Systems. COHSE eploits ontologies to generate concep-tual links between documents and to associate metadata with documents. Currently,ontologies are restricted to thesauri offering broader-term, narrower-term and related-term relations. By enriching documents with conceptual links COHSE provides extrainformation and linking for existing web pages, i.e. COHSE puts emphasis on the link-age and navigation aspects between web pages. The link generator module of COHSE9 A demo version of FZI-Broker is available at http://panther.fzi.de:2000/demofzibroker.10 http://www.yahoo.com


and the concept vector representation of SEAL-II use similar techniques for associatingterms in documents with concepts from the ontology. In contrast to COHSE, we alsoput emphasis on the querying aspects of a portal. Furthermore, we rely on a “heavy-weight” ontology being exploited by our Ontobroker inference engine (Decker et al.,1999).

The Broker’s Lounge approach (Jarke et al., 2001) provides methods and tools foruser-adaptive knowledge management putting emphasis on contextualization and con-ceptualization. Personalization is defined by interest profiles that are based on the con-cepts of the domain model and their categorization into different types. Whereas theBroker’s Lounge puts special emphasis on contextualization of knowledge, SEAL-IIsets up an overall portal framework offering a smooth integration of ontology-basedand information retrieval based techniques.

The Ontobroker system and the knowledge warehouse (Decker et al., 1999) lay thesemantical foundations for the SEAL-II approach. The approach closest to Ontobrokeris SHOE (Heflin & Hendler, 2000). In SHOE, HTML pages are annotated via ontolo-gies to support information retrieval based on semantic information. Besides the use ofontologies and the annotation of web pages the underlying techniques of both systemsdiffer significantly: SHOE offers only very limited inferencing capabilities, whereasOntobroker relies on Frame-Logic and thus supports complex inferencing for query an-swering. A more detailed comparison to other portal approaches and underlying meth-ods may be found in (Staab et al., 2000).

Related Work on Focused Crawling.The need for focused crawling in general hasrecently been realized by several researchers. The main target of all of these approachesis to focus the search of the crawler and to enable goal-directed crawling. In generala focused crawler takes a set of well-selected web pages exemplifying the user inter-est. (Chakrabarti et al., 1999) present a generic architecture of a focused crawler. Thecrawler uses a set of predefined documents associated with topics in a Yahoo like tax-onomy to build a focused crawler. Two hypertext mining algorithms build the core oftheir approach: a classifier that evaluates the relevance of a hypertext document withrespect to the focus topics, and a destiller, that identifies hypertext nodes that are accesspoints to many relevant pages within a few links. The approach presented in (Diligentiet al., 2000) uses so-called context graphs as a means to model the paths leading to rel-evant web pages. Context graphs in their sense represent link hierarchies within whichrelevant web pages occur together in the context of such pages. (Rennie & McCal-lum, 1999) propose a machine learning oriented approach for focused crawling. Theircrawler uses reinforcement learning to learn to choose the next link such that rewardover time is maximized. A problem of their approach is that the method requires largecollections of already visited web pages.

In contrast to our approach these crawlers do not include relevancy that exploitslexical & ontological information for focusing the search. Additionally, none of the


crawlers above supports the combined crawling of documents and metadata.

Related Work on Clustering.All clustering approaches based on frequencies of terms/concepts and similarities of data points suffer from the same mathematical properties ofthe underlying spaces (cf. (Beyer et al., 1999; Hinneburg et al., 2000)). These propertiesimply that even when “good” clusters with relatively small mean squared errors can bebuilt, these clusters do not exhibit significant structural information as their data pointsare not really more similar to each other than to many other data points. Therefore, wederive the high-level requirement for text clustering approaches: either they should relyon much more background knowledge (and thus can come up with new measures forsimilarity) or they should cluster in subspaces of the original high dimensional repre-sentation.

In general, existing approaches (e.g., (Agrawal et al., 1998; Hinneburg & Keim,1999)) on subspace clustering face the dual nature of “good quality”. On the one hand,there are sound statistical measures for judging quality. State-of-the-art methods usethem in order to produce “good” projections and, hence, “good” clustering results, forinstance:

– Hinneburg & Keim (Hinneburg & Keim, 1999) show how projections improve theeffectiveness and efficiency of the clustering process. Their work shows that pro-jections are important for improving the performance of clustering algorithms. Incontrast to our work, they do not focus on cluster quality with respect to the internalstructures contained in the clustering.

– The problem of clustering high-dimensional data sets has been researched byAgrawal et al. (Agrawal et al., 1998): They present a clustering algorithm calledCLIQUE that identifies dense clusters in subspaces of maximum dimensionality.Cluster descriptions are generated in the form of minimized DNF expressions.

– A straightforward preprocessing strategy may be derived from multivariate statisti-cal data analysis known under the name principal component analysis (PCA). PCAreduces the number of features by replacing a set of features by a new feature rep-resenting their combination.

– In (Schuetze & Silverstein, 1997), Schuetze and Silverstein have researched andevaluated projection techniques for efficient document clustering. They show howdifferent projection techniques significantly improve performance for clustering,that is not accompanied by a loss of cluster quality. They distinguish between localand global projection, where local projection maps each document onto a differentsubspace, and, global projection selects the relevant terms for all documents usinglatent semantic indexing.

Now, on the other hand, in real-world applications the statistically optimal projec-tion, such as used in the approaches just cited, often does not coincide with the pro-


jection most suitable for humans to solve a particular task, such as finding the rightpiece of knowledge in a large set of documents. Users typically prefer explicit back-ground knowledge that indicates the foundations on which a clustering result has beenachieved.

Hinneburg et al. (Hinneburg et al., 1999) consider this general problem as a domainspecific optimization task. Therefore, they propose to use a visual and interactive envi-ronment that involves the user to derive meaningful projections. Our approach automat-ically solves some part of the task they assign to the user environment automatically:we give the user some first means to explore the result space interactively in order toselect the projection most relevant for her particular objectives.

Finally, we want to mention an interesting proposal for feature selection made in(Devaney & Ram, 1998). Devaney and Ram describe feature selection for an unsu-pervised learning task, namely conceptual clustering. They discuss a sequential featureselection strategy based on an existing COBWEB conceptual clustering system. In theirevaluation they show that feature selection significantly improves the results of COB-WEB. The drawback that Devaney and Ram face, however, is that COBWEB is notscalable like K-Means. Hence, for practical purposes of clustering in large documentrepositories, our approach seems better suited.

10 Conclusion

We have shown in this paper how to extend our methodology for semantic portals,SEAL, into SEAL-II, such that the range of possible uses is extended in several ways:First, the initial start-up process of the portal becomes easier. Second, the overall ap-proach is made more flexible creating synergy between unstructured resources and on-tology available as background knowledge. Third, we have started to investigate thearchitecture of an overarching framework concerning the use of ontologies.

Let us briefly elaborate on the latter. In a typical application the ontology has mostoften played only a single role. We, however, see ontologies as a means to structurecontent that may be engaged in a plethora of means even within one single application.The arrival of readily configurable software modules (like “products” in the applica-tion server ZOPE) makes itpractically much more feasible to exploit the ontology forsuch broad variations as content structuring, content selection, content presentation,and content providing. We envision a situation where we will have a collection of var-ious ontologies that are clearly aligned to each other and that are used to support thesefunctionalities.

The technology that we have discussed here is very general. Obvious uses exist forknowledge management. In fact, its very initial conception was triggered by a KM ap-plication for DaimlerChrysler Human Resource Management, i.e. the HR-TopicBroker.However, many more and very different applications are probably just appearing on thehorizon right now.


Acknowledgements.

Research reported in this paper has been partially financed by BMBF in the projectGETESS (01IN901C0), by US AirForce in the DARPA DAML OntoAgents project, byEU in the IST project On-To-Knowledge (IST-1999-10132) and by DaimlerChrysler inthe project “HR-TopicBroker”. We thank our student Lars Kuehn and our colleagues,in particular Kalvis Apsitis, Nenad Stojanovic, Gerd Stumme, York Sure and RaphaelVolz for discussions and work that influenced parts of this paper, e.g. our notion of on-tologies, the RDF crawler (written by Kalvis), and the overall architecture. We thankKlaus Goetz from DaimlerChrysler for discussions that influenced our thinking on in-formation brokering.

References

Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspaceclustering of high dimensional data for data mining applications. InProceedings ofthe ACM SIGMOD Int’l Conference on Management of Data, Seattle, Washington,pages 94–105. ACM Press.

Angele, J., Schnurr, H.-P., Staab, S., & Studer, R. (2000). The times they area-changin’ — the corporate history analyzer. In Mahling, D. & Reimer,U. (Eds.), Proceedings of the Third International Conference on PracticalAspects of Knowledge Management, Basel, Switzerland, pages 1–1 – 1–11.http://research.swisslife.ch/pakm2000/.

Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is ‘nearest neigh-bor’ meaningful. InProceedings of ICDT-1999, Jerusalem, Israel, pages 217–235.ACM Press.

Carr, L., Bechhofer, S., Goble, C., & Hall, W. (2001). Conceptual linking: Ontology-based open hypermedia. InWWW10 — Proceedings of the 10th InternationalWorld Wide Web Conference, Hong Kong, China, pages 334–342. ACM Press.

Chakrabarti, S., van den Berg, M., & Dom, B. (1999). Focused crawling: a new ap-proach to topic-specific web resource discovery. InWWW8 - Proceedings of the8th World Wide Web Conference, Toronto, Canada, pages 545–562. Elsevier.

Decker, S., Erdmann, M., Fensel, D., & Studer, R. (1999). Ontobroker: Ontology basedaccess to distributed and semi-structured information. In Meersman, R. et al.(Eds.),Database Semantics: Semantic Issues in Multimedia Systems, Boston, USA,pages 351–369. Kluwer Academic Publisher.

Devaney, M. & Ram, A. (1998). Efficient feature selection in conceptual clustering. InProceedings 14th International Conference on Machine Learning, Nashville, TN,pages 92–97. Morgan Kaufmann.


Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C. L., & Gori, M. (2000). Focusedcrawling using context graphs. InProceedings of the 26th International Confer-ence on Very Large Databases (VLDB-00), Cairo, Egypt, pages 527–534. MorganKaufmann.

Eriksson, H., Puerta, A., & Musen, M. A. (1994). Generation of knowledge-acquisitiontools from domain ontologies.International Journal of Human Computer Stud-ies(IJHCS), 41:425–453.

Fellbaum, C. (1998).WordNet – An electronic lexical database. MIT Press, Cambridge,Massachusetts and London, England.

Fensel, D., van Harmelen, F., Horrocks, I., McGuinness, D. L., & Patel-Schneider, P. F.(2001). OIL: An ontology infrastructure for the semantic web.IEEE IntelligentSystems, 16(2):38–44.

Grosso, W. E., Eriksson, H., Fergerson, R. W., Gennari, J. H., Tu, S. W., & Musen,M. A. (1999). Knowledge modeling at the millennium (the design and evolution ofProtege-2000). InProceedings of the 12th Workshop for Knowledge Acquisition,Modeling and Management (KAW’99), Banff, Canada.

Gruber, T. (1993). A translation approach to portable ontology specifications.Knowl-edge Acquisition, 5:199–220.

Heflin, J. & Hendler, J. (2000). Searching the web with SHOE. InArtificial Intelligencefor Web Search. Papers from the AAAI Workshop. WS-00-01, Menlo Park, CA,pages 35–40. AAAI Press.

Hinneburg, A., Aggarwal, C., & Keim, D. A. (2000). What is the nearest neighbor inhigh dimensional spaces? InProceedings of the 26th International Conference onVery Large Databases (VLDB-00), Cairo, Egypt, pages 506–515. Morgan Kauf-mann.

Hinneburg, A. & Keim, D. A. (1999). Optimal grid-clustering: Towards breaking thecurse of dimensionality in high-dimensional clustering. InProceedings of the 25thInternational Conference on Very Large Databases (VLDB-99), Edinburgh, Scot-land, pages 506–517. Morgan Kaufmann.

Hinneburg, A., Wawryniuk, M., & Keim, D. A. (1999). Hd-eye: visual mining of high-dimensional data.IEEE Computer Graphics and Applications, 19(5):22–31.

Hotho, A., Maedche, A., & Staab, S. (2001). Ontology-based text clustering. InPro-ceedings of the IJCAI-2001 Workshop “Text Learning: Beyond Supervision”, Au-gust, Seattle, USA.


Jarke, M., Klemke, R., & Nick, A. (2001). Broker’s lounge — an environment formulti-dimensional user-adaptive knowledge management. InHICSS-34: 34thHawaii International Conference on System Siences, Maui, Hawaii, pages 83–83.http://fit.gmd.de/ klemke/.

Kifer, M., Lausen, G., & Wu, J. (1995). Logical foundations of object-oriented andframe-based languages.Journal of the ACM, 42:741–843.

Maedche, A., Staab, S., Stojanovic, N., & Studer, R. (2001). SEAL - A Frameworkfor Developing SEmantic portALs. InProceedings of the 18th British NationalConference on Databases, July, Oxford, UK, LNCS. Springer.

Nanard, J. & Nanard, M. (1991). Using structured types to incorporate knowledge inhypertext. InProceedings of the ACM Hypertext Conference, San Antonio, Texas,USA, pages 329–343. ACM Press.

Osterbye, K. & Wiil, U. (1996). The flag taxonomy of open hypermedia systems. InProceedings of the ACM Conference on Hypertext, Washington, DC, pages 129–139. ACM Press.

Pinkerton, B. (1994). Finding what people want: Experiences with the webcrawler.In WWW2 — Proceedings of the 2nd International World Wide Web Conference,Chicago, USA.

Porter, M. F. (1980). An algorithm for suffix stripping.Program, 14(3):130–137.

Rennie, J. & McCallum, A. (1999). Using reinforcement learning to spider the web effi-ciently. InProceedings of the 16th International Conference on Machine Learning(ICML-99), Bled, Slovenia, pages 335–343.

Schuetze, H. & Silverstein, C. (1997). Projections for efficient document clustering. InProceedings of SIGIR-1997, Philadelphia, PA, pages 74–81. Morgan Kaufmann.

Staab, S., Angele, J., Decker, S., Erdmann, M., Hotho, A., Maedche, A., Schnurr, H.-P., Studer, R., & Sure, Y. (2000). Semantic community web portals. InWWW9— Proceedings of the 9th International World Wide Web Conference, Amsterdam,The Netherlands, pages 473–491. Elsevier.

Staab, S., Braun, C., D¨usterhoft, A., Heuer, A., Klettke, M., Melzig, S., Neumann, G.,Prager, B., Pretzel, J., Schnurr, H.-P., Studer, R., Uszkoreit, H., & Wrenger, B.(1999). GETESS — searching the web exploiting german texts. InProceedings ofthe 3rd Workshop on Cooperative Information Agents, Uppsala, Sweden, LNCS,pages 113–124. Springer.

Staab, S., Schnurr, H.-P., Studer, R., & Sure, Y. (2001). Knowledge processes andontologies.IEEE Intelligent Systems, 16(1):26–34.


Staab, S. & Maedche, A. (2001). Knowledge portals — ontologies at work.AI Maga-zine, 21(2).

Sure, Y., Maedche, A., & Staab, S. (2000). Leveraging corporate skill knowledge - FromProPer to OntoProper. In Mahling, D. & Reimer, U. (Eds.),Proceedings of theThird International Conference on Practical Aspects of Knowledge Management,Basel, Switzerland, pages 22–1 – 22–9. http://research.swisslife.ch/pakm2000/.


seal-ii — the soft spot between richly structured and ...seal-ii — the soft spot between richly...

Documents