web information tracking using ontologies

Copyright © 2004 John Wiley & Sons, Ltd. Intell. Sys. Acc. Fin. Mgmt. 12, 215–225 (2004)

WEB INFORMATION TRACKING 215INTELLIGENT SYSTEMS IN ACCOUNTING, FINANCE AND MANAGEMENTIntell. Sys. Acc. Fin. Mgmt. 12, 215–225 (2004)Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/isaf.230

Copyright © 2004 John Wiley & Sons, Ltd.

WEB INFORMATION TRACKING USING ONTOLOGIES

ALEXANDER MAEDCHEFZI Research Center for Information Technologies, University of Karlsruhe, Germany

SUMMARYBringing knowledge management to practice one typically has to focus on concrete problems that exist inthe daily work of the knowledge worker. We consider the task of tracking relevant information on the Webas important and time consuming, and thus as a concrete problem. In this paper we introduce an integratedapproach for Web information tracking using ontologies. The overall approach has been implemented withina case study carried out at DaimlerChrysler AG. Copyright © 2004 John Wiley & Sons, Ltd.

1. INTRODUCTION

It is well known and widely agreed that when introducing and trying to establish knowledgemanagement one has to focus on concrete problems and provide support for knowledge-intensivetasks that appear in the daily work of the knowledge workers. We consider tracking relevant informa-tion on the Web as a typical knowledge-intensive and time-consuming task, which currently lacksany knowledge management support. At the moment this is mainly done in an ad hoc fashion viabrowsing the Web by looking up URLs and pursuing hyperlinks, via querying search engines,registering to mailing lists and via inspecting specialized Web portals.

In this paper we present a new approach to support Web information tracking using ontologies.The overall approach is composed of the following ingredients. First of all, we use ontologies aspredefined knowledge models describing the information needs of a specific knowledge worker ora group of knowledge workers, e.g. a department within a company. Second, we identify differentkinds of information available on the Web providing relevant tracking sources. Third, we providemeans to connect the different forms of Web information with the ontology. Finally, interaction withthe user is supported via ontology-driven Web sites.

The organization of this paper is as follows. The next section introduces our notion of ontologiesas a basis for knowledge modeling using a simple example. The third section introduces the trackingframework. The fourth section discusses our case study that has been implemented on the basis ofthe aforementioned framework. Before we conclude we also provide a short overview on relatedwork in the penultimate section.

2. ONTOLOGIES

The application of ontologies (an ontology is a conceptual model shared between autonomous agentsin a specific domain)—shared conceptualizations of some domain—is increasingly seen as key to

* Correspondence to: A. Maedche, FZI Research Center for Information Technologies, University of Karlsruhe, D-76131Karlsruhe, Germany. E-mail: [email protected]/grant sponsor: European Commission; Contract/grant number: IST-2000-28293.Contract/grant sponsor: DaimlerChrysler AG.

216 A. MAEDCHE


Figure 1. Example OI-Model

enable semantics-driven information access. There are many applications of such an approach, e.g.automated information processing, information integration or knowledge management, to name justa few.

As mentioned earlier, ontologies within our tracking approach are seen as predefined knowledgemodels describing the information needs of a specific knowledge worker or a group of knowledgeworkers, e.g. a specialized department. We use the notion of so-called ontology-instance models(OI-Models) within our approach. We refer the interested reader to Botik et al. (2002) where a moredetailed description, a mathematical model and a denotational semantics for OI-Models is provided.Figure 1 provides an example of an OI-Model.

An important aspect of OI-Models is that they provide flexible means of modeling concepts andinstances, as well as properties between them. Then consider concepts and instances on the samelevel, providing means for modeling meta-concepts and meta-properties. Thus, we do not make anyexplicit distinction between concepts and instances. Additionally, means for modularization ofOI-Models are provided, e.g. in our example the overall OI-Model is constructed out of a Root node,an HR model, a document model and a topic model. Furthermore, OI-Models include lightweightinferences, like transitive and symmetric properties, as well as inverse properties.

Furthermore, a lexical layer may be associated to each entity of an OI-Model. This is an importantaspect with respect to the tracking functionality described within this paper. Figure 2 depicts ascreenshot of an OI-Model within the KAON OI-Modeler1 including the lexical layer definition offor the concept ‘Topic’.

The reader may note that an m:n relation between the lexical layer of an ontology and its entitieshas to be described. We also allow one to distinguish between different types of so-called lexical

1 http://kaon.semanticweb.org


WEB INFORMATION TRACKING 217

Figure 2. Screenshot of the OI-Modeler

entry, e.g. labels, stems, etc. This aspect is quite important when using natural language-processingfacilities within the tracking process.

3. A FRAMEWORK FOR WEB INFORMATION TRACKING

Knowledge workers typically deal with different kinds of information that are available on the Web.Within our approach we distinguish between the following two different information classes:

Core Information. The first we consider is the core information that is provided by the WorldWide Web. Roughly speaking, this core information is restricted to documents and hyperlinksbetween these documents. This information is typically accessed directly via browsing the Web.

Information Containers. Within the second class we consider so-called information ‘con-tainers’ that already contain preprocessed information (manually by humans or automatically bymachines) based on the core information introduced above. Within this second class of informa-tion we distinguish between the following information containers:

• Free Text information containers, e.g. document search engines (like Google, Altavista), clas-sification directories (Yahoo!), etc.

• Semi-structured information containers (e.g. Web portals that provide news, etc.)

• Structured information containers (e.g. Web portals that serve as interfaces to structured andrelational databases, etc.)

218 A. MAEDCHE


Based on this classification, we introduce the overall framework for Web information tracking,where ontologies play a central role. It is important to emphasize that, in general, connectingontologies with existing Web information is a challenging task. As well as core information, pre-processed information provided by information containers as introduced in the last section requirespecific means that allow for flexible access and connection to ontologies.

3.1. Web Information Tracking Framework

Recently, the vision of having a Semantic Web in which information is given a machine-processableand -interpretable meaning has been coined. This vision is centrally based on ontologies and infor-mation that is described via ontology-based metadata. Unfortunately, there is not much metadataavailable on the current Web, so that connecting ontologies with existing Web information is notso straightforward. In this paper we present a pragmatic approach to connect ontologies with avail-able Web information. Within this approach the following points are of importance:

• Means for discovering ontology-relevant information from the Web are required. In general, onehas to keep in mind that one should reuse as much as possible information providers suchas search engines, news providers, Web portals, etc. In this sense, the ontology only provides ameta-index on top of the information containers.

• Information sources have be to indexed and integrated on the basis of the ontology. It is import-ant to emphasize that within our approach we target to reduce human intervention as muchas possible. It is obvious that there is a trade-off between obtaining a high-quality connectionbetween ontologies and Web information and the time that is invested to develop a high-qualityconnection.

• Based on indexed and integrated information one should provide value-adding analysis methodsthat allow one to discover implicit relationships and patterns.

• Means for browsing and querying have to be provided to the end user. Both a document-drivenview and a resource-centered view have to be provided to the user.

Figure 3 depicts the overall picture of our framework. In this architecture, the OI-Model buildsthe central backbone. It serves as input to the focused crawler, it allows one to be connected to WebServices, it indexes documents, etc. Furthermore, it allows one to integrate and to store materializedinstances from the arbitrary information containers. Finally, it supports presentation in the form ofbrowsing and querying.

3.2. Connecting Web Information with Ontologies

In the following we give a list and description of several connectors to be used for free informationon the Web, as well as for free text, semi-structured and structured information containers. Theoverall goal is to provide a tight connection between existing information with the OI-Model. In anoptimal case this means that instances and properties between instances are extracted from theexisting information sources according to the predefined OI-Model.

Focused Crawler. We use a focused crawler for discovering relevant core information on theWeb (Maedche et al., 2002a). Actually, the crawler simulates human Web browsing behavior bymeasuring the relevancy of a given Web page according to the defined ontology. If a Web page isconsidered as relevant, then the outgoing links are pursued. Thus, a focused search for relevant



Figure 3. Web information tracking architecture

information on the Web is pursued. The ontology serves as a reference model for measuringrelevancy of information.

Web Service Interfaces. For information containers containing already preprocessed information weuse the different Web Service interfaces that have recently been made available:

• Google Web Service Interface. This Web Service provides a simple service allowing one to queryGoogle with a set of key words and retrieving the top ten Web pages for these key words. Weuse the lexicon of the OI-Model to instantiate this Web Service.

• RSS (RDF Site Summary) News Syndication Format. This Web Service may be considered asa content service. RSS is a simple metadata-driven news description scheme.2 It provides meansto describe news channels and to publish news items within this channels.

In general, one may adopt to arbitrary Web Service interfaces that may be provided by Webportals on top of their underlying databases. Clearly, this is a question of the underlying propertyrights and the associated business model of the Web portal provider.

Wrappers. The old-fashioned and well-known way to connect information containers with ontologiesis wrappers. Wrappers allow one to translate the data from the underlying data model of the asso-ciated information container into the common data model, namely the ontology in our trackingapproach. Within our approach, we distinguish between two different wrappers:

• Database Wrapper. The database wrapper REVERSE3 (Stojanovic et al., 2002) allows one to liftan arbitrary relational database onto an ontology. It pursues a materialized integration strategy,and thus the content of the database is copied into the ontology.

2 http://www.purl.org/rss/1.0/3 http://kaon.semanticweb.org/REVERSE

220 A. MAEDCHE


• Web page Wrapper: Recently, several Web page wrappers have been developed, e.g. the WysWygWeb Wrapper Factory W4F.4 Typically, these tools allow one to define extraction rules to extractcontent from Web pages.

3.3. Ontology-based Indexing, Integration and Analysis

The indexing and integration layer works on top of the connected information sources. This layerhas to deal with different kinds of information:

• documents (discovered by the focused crawler, or retrieved by the Google Web Service)

• news (already includes some basic metadata that may be represented in the form of instances ofontologies)

• instances of the ontology (e.g. as provided from a database or Web page wrapper or obtained bycalling a specialized Web Service)

Indexing of textual data (documents, news summarizations, etc.) is done on the basis of the lexicallayer of our OI-Models in combination with shallow text-processing techniques. Indexing in thissense can be seen as automatically adding ‘subjects’ to the documents, where subjects are taken fromthe concept and instance entities contained in the OI-Models. Thus, we build a materialized indexassociating the objects with the textual data. Furthermore, we create instances of the ontology. Theinstances represent materialized objects obtained by the different connectors, e.g. the database andthe HTML wrapper. A detailed description of the overall integration approach is provided byMaedche et al. (2002b). In the reference implementation and case study section we will see thatusers may also want to define instances manually, e.g. users may add information to the OI-Modeland therefore support knowledge sharing.

Additionally, the indexing and integration layer contains several value-adding analysis compo-nents that work on top of the index and the materialized instances. Three examples of analysiscomponents that we use are the following:

• Ontology-based Document Clustering. A classical unsupervised text-mining task is documentclustering. Hotho et al. (2001) have described how ontologies can be used to approach the mattersof subjectivity and explainability within clustering.

• Concept-based Association Rules. Association rules have been established in the area of data min-ing, thus finding interesting association relationships among a large set of data items. The algor-ithm described Maedche and Staab (2000) learns generalized association rules between concepts.

• Instance Clustering. The instance clustering approach described Maedche and Zacharias (2002)takes instances and instance relations as higher level input for clustering objects. It providesmeans for similarity-based clustering of objects according to their semantic characteristics.

3.4. Presentation

This section is about how the end user may interact with the discovered and ontology-enhanced Webinformation. Typically, one can distinguish between the following usage scenarios. First, the user

4 http://db.cis.upenn.edu/DL/WWW8/



Figure 4. KAON Portal

may just browse for relevant information along the ontology and its relations. Results obtained byapplying an analysis method as described earlier may also be inspected on the basis of browsing.Second, users may also define concrete queries ranging from key-word-based queries to SQL-likequeries. Furthermore, they may be interested in executing a specific analysis method with a selectedset of integrated information sources (Figure 4).

Browsing. Browsing gives knowledge workers to explore the information provided by the integratedinformation and information containers. Browsing, however, in our framework is not just clickingon hyperlinks.

We consider browsing in the context of the Semantic Web vision, and thus have a resource-centered view onto the data. Resources in this sense can be everything, e.g. documents, people,topics, projects, ideas. Users explore these resources by following the properties that exist betweenthe different resources.

Querying and Execution. Queries may range from simple Boolean key words, queries by exampleand complex queries. Simple key-word-based querying is realized via mapping the key-word stringonto the OI-Model as described earlier. All entities from the OI-Model are retrieved that are referredby the key-word string. By selecting one of the retrieved entities, the user switches to the browsingmode and explores associated entities by analyzing the associated properties.

The next layer of query complexity is supporting queries by example. Query by example (QBE)is a well-known technique to reduce the complexity of defining a query. QBE interfaces are directlyobtained by the ontology, its entities and associated properties. A detailed explanation of how thisis done has been introduced Maedche et al. (2002b).

222 A. MAEDCHE


Additionally, the execution interfaces allow users to define analysis tasks. Thus, for example,Hotho et al. (2001) have described how the overall document clustering task may be influenced byselecting a pre-defined set of concepts along clustering should be performed.

4. THE HR-TOPICBROKER: A CASE STUDY

The HR-TopicBroker is a simple tracking system that has been developed for DaimlerChryslerAG. The HR-TopicBroker is a system that supports the location of relevant human resource (HR)topics (strategies, trends, etc.) on the Web. Thus, the underlying idea of this system is that the HRstrategy of the DaimlerChrysler group is modeled in the form of an ontology. This explicit repre-sentation of relevant tracking topics serves as input for the underlying information containers andfor the presentation module.

For searching and tracking core information about the HR strategy we used the focused crawleras described earlier and initialized it with a set of predefined URLs (business schools, competitors,etc.). Additionally, we used the Google Web Service to get the top ten pages for entity labels andentity label combinations (e.g. Google was queried for ‘Volkswagen and E-Learning’). Thus, in theHR-TopicBroker system we only used documents as information providers. All collected docu-ments are indexed using the HR strategy ontology.

The HR-TopicBroker user interface has been embedded in the DaimlerChrysler Intranetand was made accessible for HR managers. On the presentation side we selected a Yahoo!-like,hierarchical presentation that allows for browsing, supporting a document-centric view on the onto-logy. Figure 5 shows a screenshot of the running application. By clicking on ‘E-Learning’ one getsa list of relevant documents.

Figure 5. HR-TOPICBROKER



The system also provides means for defining additional information and for information sharing.We considered the case that people discovered a relevant Web page by browsing along the proposedWeb pages. Therefore, we offered the possibility to users to define links for entities contained inthe ontology. This approach also has the advantage that the focused crawler gets new starting pointsfor its focused search for relevant information.

Additionally, we also allowed the joint definition of a ‘knowledge base’ on top of the discoveredand tracked documents. The knowledge base in the HR-TopicBroker application consisted ofinstances and instance relationships manually added to the OI-Model. For example, it is possibleto define contact partners for specific topics, e.g. the definition of a relation between the entityE-Learning and a research institute in this field. We used a template-based approach withinHR-TopicBroker system to collect these kinds of instances and instance relationships. Thus, theunderlying complexity of defining an instance and an instance relationship was not shown to theHR managers. Browsing of the contents contained in the knowledge base is again done alongthe Yahoo!-like topic hierarchy. However, when clicking on a specific topic the related instancesare shown to the user. This feature is inspired by the resource-driven browsing implemented by theKAON portal as described earlier.

5. RELATED WORK

There is an active research field called ‘competitive intelligence’ (Vedder et al., 1999). ‘Competitiveintelligence’ is to be considered as a systematic and ethical program for gathering, analyzing, andmanaging external information that can affect your company’s plans, decisions, and operations.5 Onget al. (2001) presented a tool called FOCI for flexible organization for competitive intelligence.FOCI allows a user to define and personalize the organization of the information clusters accordingto their needs and preferences into portfolios. Predefined sections for organizing information inspecific domains is also supported. The personalized portfolios created can be saved and sub-sequently tracked and shared with other users. In contrast to our work, FOCI only focuses on pureinformation as provided in the form of documents. It does not consider any kind of informationcontainers as input.

Similar work compared to our work presented in this paper has been done by Kalfoglou et al.(2001). In their work, they present myPlanet, an ontology-driven personalized Web-based service.The existing infrastructure of the PlanetOnto news publishing system is extended with ontology-based functionality focusing on the easy access to repositories of news items, a rich resource forinformation sharing. In contrast to our work, their approach is mainly focusing on news, whereasour approach provides a wider range of information containers as input.

With respect to using ontologies for document indexing and information retrieval, it has beenshown that, in clearly defined domains, ontologies are adequate means for improving recall andprecision values (Aitken and Reid, 2000). Recently, in the context of focused crawling, much workhas been done; see Chakrabarti et al. (1999) and Diligenti et al. (2000). This work is distinguishesfrom ours in the sense that it only focuses on Web document discovery and not on connectinginformation containers. Furthermore, all these approaches do not provide any kind of ontology-basedindexing and integration on top of the results of the focused crawler.

5 http://www.scip.org/ci/index.asp

224 A. MAEDCHE


6. CONCLUSION

In this paper we presented a knowledge management module focusing on the concrete problem oftracking relevant information on the Web. Ontologies represented in the form of OI-Models havebuilt the backbone for our tracking framework. Within this framework we distinguish between coreinformation available on the Web and information containers that already provide a higher-levelaccess to the information available on the Web. The distinction between these two kinds of infor-mation provider has important effects on the way we connect them with ontologies. First, for thecore information we provide a focused crawler that supports an ontology-driven search on the Web.Second, for the information containers we use different kinds of interface, such as Web Servicesand wrappers. In the middle layer of our framework we bring the different information sourcestogether and allow one to use analysis methods for extracting implicit patterns contained in thedistributed information pieces. Finally, we allow for browsing and querying the information usinga document-centered and resource-centered approach in parallel.

In the future, we will further focus on research on how to generate ontologies automatic-ally, because this is one of the main drawbacks of our current approach. In this context we haveintroduced the concept of ontology learning (Maedche, 2002), which allows for semi-automaticgeneration of ontologies from existing sources. Additionally, with respect to the costs associatedfor connecting existing information sources to ontologies, we consider Web services as a promis-ing direction. However, there is still much work to be done to allow for automatic connection ofavailable services.

ACKNOWLEDGEMENTS

Research for this paper was financed by the European Commission, IST, project ‘Ontologging’(IST-2000-28293) and by DaimlerChrysler AG, Germany. Special thanks go to Lars Kuehn, whoimplemented the HR-TopicBroker application. Thanks to Klaus Goetz, DaimlerChrysler AG,for providing stimulating comments to the HR-TopicBroker application.

REFERENCES

Aitken S, Reid S. 2000. Evaluation of an ontologybased information retrieval tool. In Proceedings of the ECAI-200 Workshop on Ontologies and PSMs, Berlin, Germany.

Chakrabarti S, van den Berg M, Dom B. 1999. Focused crawling: a new approach to topic-specific Web resourcediscovery. In Proceedings of WWW-8.

Diligenti M, Coetzee FM, Lawrence S, Giles CL, Gori M. 2000. Focused crawling using context graphs. InProceedings of the International Conference on Very Large Databases (VLDB-00), 2000; 527–534.

Hotho A, Maedche A, Staab S. 2001. Ontology-based text clustering. In Proceedings of the IJCAI-2001 Work-shop ‘Text Learning: Beyond Supervision’, August, Seattle, USA.

Kalfoglou Y, Domingue J, Motta E, Vargas-Vera M, Buckingham Shum S. 2001. myPlanet: an ontology-drivenWeb-based personalised news service. In Proceedings of the IJCAI’01 Workshop on Ontologies and Informa-tion Sharing, Seattle, WA, USA, August.

Maedche A. 2002. Ontology Learning for the Semantic Web. Kluwer Academic Publishers.Maedche A, Staab S. 2000. Discovering conceptual relations from text. In ECAI-2000—European Conference

on Artificial Intelligence. Proceedings of the 13th European Conference on Artificial Intelligence. IOS Press:Amsterdam.



Maedche A, Zacharias V. 2002. Clustering ontology-based metadata in the Semantic Web. In Proceedings ofthe Joint Conferences 13th European Conference on Machine Learning (ECML’02) and 6th EuropeanConference on Principles and Practice of Knowledge Discovery in Databases (PKDD’02). Springer, LNAI:Helsinki, Finland.

Maedche A, Ehrig M, Handschuh S, Volz R, Stojanovic L. 2002a. Ontology-focused crawling of documentsand relational metadata. In Proceedings of the Eleventh International World Wide Web Conference WWW-2002, May (Poster).

Maedche A, Staab S, Sure Y, Studer R, Volz R. 2002b. SEAL—tying up information integration and Web sitemanagement by ontologies. In IEEE Data Engineering Bulletin.

Motik B, Maedche A, Volz R. 2002. A conceptual modeling approach for semantics-driven enterpriseapplications. Internal Research Report, University of Karlsruhe, 2002. Available at http://kaon.aifb.uni-karlsruhe.de/conc-model.

Ong H, Tan A-H, Ng J, Pan H, Li Q-X. 2001. FOCI: flexible organizer for competitive intelligence. InProceedings of the 2001 ACM CIKM International Conference on Information and Knowledge Management,Atlanta, Georgia, USA, 5–10 November, ACM.

Stojanovic L, Stojanovic N, Volz R. 2002. Migrating data-intensive Web sites into the Semantic Web. InProceedings of the ACM Symposium on Applied Computing SAC-02, Madrid.

Vedder RG, Vanecek MT, Guynes CS, Cappel JJ. 1999. Ceo and cio perspectives on competitive intelligence.Communications of the ACM 42(8): 108–116.

web information tracking using ontologies

Documents