donutsdocshare04.docshare.tips/files/1740/17401449.pdf · acknowledgements many people deserve a...

POLITECNICO DI TORINO

SCUOLA DI DOTTORATO

Dottorato in Ingegneria Informatica e dell’automazione – XVIII ciclo

Tesi di Dottorato

Architectures and Algorithms forIntelligent Web ApplicationsHow to bring more intelligence to the web and beyond

Dario Bonino

Tutore Coordinatore del corso di dottoratoprof. Fulvio Corno prof. Pietro Laface

Dicembre 2005

Acknowledgements

Many people deserve a very grateful acknowledgment for their role in supporting meduring these long and exciting 3 years. As first I would cite my adviser, Fulvio Corno,which always supported and guided me toward the best decisions and solutions.Together with Fulvio I want to thank Laura Farinetti very much too: she was thereat anytime I needed her help for both insightful discussions and stupid questions.Thanks to all my colleagues in the e-Lite research group for their constant support,for their kindness and their ability to ignore my bad moments. Particular thanksshall go to Alessio for being not only my best colleague and competitor, but one ofthe best friend I’ve ever had. The same goes to Paolo, our railway discussions havebeen so much interesting and useful!

Thanks to Mike that introduced me to many Linux secrets, to Franco for beingthe calmest person I’ve ever known and to Alessandro ”the Eye tracker”, the bestsurfer I’ve ever met.

Thanks to my parents Ercolino and Laura and to my sister Serena, they havealways been my jumping board and my unbreakable backbone.

Thank you to all the people I’ve met in these years and that are not cited here,I am very glad to have been with you, even if for only few moments, thank you!

I

Contents

Acknowledgements I

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 The Semantic Web vision 82.1 Semantic Web Technologies . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Explicit Metadata . . . . . . . . . . . . . . . . . . . . . . . . . 102.1.2 Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.1 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Semantic Web Languages . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.1 RDF and RDF-Schema . . . . . . . . . . . . . . . . . . . . . . 152.3.2 OWL languages . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.3 OWL in a nutshell . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Web applications for Information Management 253.1 The knowledge sharing and life cycle model . . . . . . . . . . . . . . 263.2 Software tools for knowledge management . . . . . . . . . . . . . . . 27

3.2.1 Content Management Systems (CMS) . . . . . . . . . . . . . . 283.2.2 Information Retrieval systems . . . . . . . . . . . . . . . . . . 343.2.3 e-Learning systems . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Requirements for Semantic Web Applications 494.1 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . . . 494.2 Use Cases for Functional Requirements . . . . . . . . . . . . . . . . . 53

4.2.1 The “Semantic what’s related” . . . . . . . . . . . . . . . . . 534.2.2 The “Directory search” . . . . . . . . . . . . . . . . . . . . . . 53

II

4.2.3 The “Semi-automatic classification” . . . . . . . . . . . . . . . 544.3 Non-functional requirements . . . . . . . . . . . . . . . . . . . . . . . 56

5 The H-DOSE platform: logical architecture 595.1 The basic components of the H-DOSE semantic platform . . . . . . . 60

5.1.1 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.1.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.1.3 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 Principles of Semantic Resource Retrieval . . . . . . . . . . . . . . . . 695.2.1 Searching for instances . . . . . . . . . . . . . . . . . . . . . . 695.2.2 Dealing with annotations . . . . . . . . . . . . . . . . . . . . . 705.2.3 Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.2.4 Searching by conceptual spectra . . . . . . . . . . . . . . . . . 73

5.3 Bridging the gap between syntax and semantics . . . . . . . . . . . . 745.3.1 Focus-based synset expansion . . . . . . . . . . . . . . . . . . 755.3.2 Statistical integration . . . . . . . . . . . . . . . . . . . . . . . 78

5.4 Experimental evidence . . . . . . . . . . . . . . . . . . . . . . . . . . 795.4.1 Multilingual approach results . . . . . . . . . . . . . . . . . . 795.4.2 Conceptual spectra experiments . . . . . . . . . . . . . . . . . 815.4.3 Automatic learning of text-to-concept mappings . . . . . . . . 83

6 The H-DOSE platform 856.1 A layered view of H-DOSE . . . . . . . . . . . . . . . . . . . . . . . . 86

6.1.1 Service Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.1.2 Kernel Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 916.1.3 Data-access layer . . . . . . . . . . . . . . . . . . . . . . . . . 976.1.4 Management and maintenance sub-system . . . . . . . . . . . 98

6.2 Application scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.2.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1016.2.2 Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.3 Implementation issues . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7 Case studies 1077.1 The Passepartout case study . . . . . . . . . . . . . . . . . . . . . . . 107

7.1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1097.2 The Moodle case study . . . . . . . . . . . . . . . . . . . . . . . . . . 1127.3 The CABLE case study . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.3.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . 1147.3.2 mH-DOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.4 The Shortbread case study . . . . . . . . . . . . . . . . . . . . . . . . 1197.4.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . 120

III

7.4.2 Typical Operation Scenario . . . . . . . . . . . . . . . . . . . 1227.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8 H-DOSE related tools and utilities 1258.1 Genetic refinement of semantic annotations . . . . . . . . . . . . . . . 126

8.1.1 Semantics powered annotation refinement . . . . . . . . . . . 1288.1.2 Evolutionary refiner . . . . . . . . . . . . . . . . . . . . . . . . 131

8.2 OntoSphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1398.2.1 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . 1418.2.2 Implementation and preliminary results . . . . . . . . . . . . . 145

9 Semantics beyond the Web 1499.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

9.1.1 Application Interfaces, Hardware and Appliances . . . . . . . 1529.1.2 Device Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . 1539.1.3 Communication Layer . . . . . . . . . . . . . . . . . . . . . . 1549.1.4 Event Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . 1559.1.5 House Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 1559.1.6 Domotic Intelligence System . . . . . . . . . . . . . . . . . . . 1569.1.7 Event Logger . . . . . . . . . . . . . . . . . . . . . . . . . . . 1569.1.8 Rule Miner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1569.1.9 Run-Time Engine . . . . . . . . . . . . . . . . . . . . . . . . . 1579.1.10 User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 157

9.2 Testing environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 1579.2.1 BTicino MyHome System . . . . . . . . . . . . . . . . . . . . 1589.2.2 Parallel port and LEDs . . . . . . . . . . . . . . . . . . . . . . 1589.2.3 Music Server (MServ) . . . . . . . . . . . . . . . . . . . . . . 1599.2.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 159

9.3 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

10 Related Works 16110.1 Automatic annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 16210.2 Multilingual issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

10.2.1 Term-related issues . . . . . . . . . . . . . . . . . . . . . . . . 16410.3 Semantic search and retrieval . . . . . . . . . . . . . . . . . . . . . . 16510.4 Ontology visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 16610.5 Domotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

11 Conclusions and Future works 17211.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17211.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

IV

Bibliography 178

A Publications 186

V

Chapter 1

Introduction

The Semantic Web will be an extension of the current Web in whichinformation is given well-defined meaning, better enabling computersand people to work in cooperation.

Tim Berners-Lee

The new generation of the Web will enable humans to gain wisdom ofliving, working, playing,and learning, in addition to information searchand knowledge queries.

Ning Zhong

1.1 Motivation

Over the last decade the World Wide Web gained a great momentum becomingrapidly a fundamental part of our everyday life. In personal communication, as wellas in business, the impact of the global network has completely changed the waypeople interact with each other and with machines. This revolution touches all theaspects of people lives and is gradually pushing the world toward a “Knowledge so-ciety” where the most valuable resources will no longer be material but informative.

Also the way we think of computers has been influenced by this development:we are in fact evolving from thinking computers as ”calculus engines” to consid-ering computers as “gateways” or entry points to the newly available informationhighways.

The popularity of the Web has lead to the exponential growth of published pagesand services observed in these years. Companies are now offering web pages to adver-tise and sell their products. Learning institutions are presenting teaching materialand on-line training facilities. Governments provide web-accessible administrative

1

1 – Introduction

services to ease the citizen’s life. Users build up communities to exchange any kindof information and/or to form more powerful market actors able to survive in thisglobal ecosystem.

The stunning success is also the curse of the current web: most of todays’s Webcontent is only suitable for human consumption, but the huge number of availableinformation makes it increasingly difficult for users to find and access required infor-mation. Under these conditions, keyword-based search engines, such as AltaVista,Yahoo, and Google, are the main tools for using today’s Web. However, there areserious problems associated with their use:

• High recall, low precision: relevant pages are buried into other thousandsof low interesting pages.

• Low recall: although rarer, sometimes happens that queries get no answersbecause they are formulated with the wrong words.

• Results are sensitive to vocabularies, so, for example, the adoption ofdifferent synonymous of the same keyword may lead to different results.

• Searches are for documents and not for information.

Even if the search process is successful, the result is only a “relevant” set of webpages that the user shall scan for finding the required information. In a sense, theterm used to classify the involved technologies, Information Retrieval, is in this caserather misleading and Location Retrieval might be better.

The critical point is that, at now, machines are not able, without heuristicsand tricks, to understand documents published on the web and to extract only therelevant information from pages. Of course there are tools that can retrieve text,split phrases, count words, etc. But when it comes to interpreting and extractinguseful data for the users, the capabilities of current software are still limited.

One solution to this problem consists in keeping information as it currentlyis and in developing sophisticated tools that use artificial intelligence techniquesand computational linguistics to “understand” what is written in web pages. Thisapproach has been pursued for a while, but, at now, still appears too ambitious.

Another approach is to define the web in a more machine understandable fashionand to use intelligent techniques to take advantage of this representation.

This plan of revolutionizing the web is usually referred to as Semantic Webinitiative and is only a single aspect of the next evolution of the web, the WisdomWeb.

It is important to notice that the Semantic Web aims not at being parallel to theWorld Wide Web, instead it aims at evolve the Web into a new knowledge centric,global network. Such a new network will be populated by intelligent web agents able

2

1.2 – Domain

to act on behalf of their human counterparts, taking into account the semantics ofinformation (meaning). Users will be once more the center of the Web but they willbe able to communicate and to use information with a more human-like interactionand they will also be provided with ubiquitous access to such information.

1.2 Domain: the long road from today’s Web to

the Wisdom Web

Starting from the current web, the ongoing evolution aims at transforming the to-day’s syntactic World Wide Web in the future Wisdom Web. The fundamentalcapabilities of this new network include:

• Autonomic Web support, allowing to design self-regulating systems able tocooperate in autonomy with other available applications/information sources.

• Problem Solving capabilities, for specifying, identifying and solving roles,settings and relationships between services.

• Semantics, ensuring the “right” understanding of involved concepts and theright context for service interactions.

• Meta-Knowledge, for defining and/or addressing the spatial and temporalconstraints or conflicts in planning and executing services.

• Planning, for enabling services or agents to autonomously reach their goalsand subgoals.

• Personalization, for understanding recent encounters and to relate differentepisodes together.

• A sense of humor, so services on the Wisdom Web will be able to interactwith users on a personal level.

These capabilities stem from several, active, research fields including the Multi-Agent community, the Semantic Web community, the Ubiquitous Computing com-munity and have impact, or use, technologies developed for databases, computa-tional grids, social networks, etc. The Widsom Web, in a sense, is the place wheremost of the currently available technologies and their evolutions will join in a singlescenario with a “devastating” impact on the human society.

Many steps, however, still separate us from this future web, the Semantic Webinitiative being the first serious and world-wide attempt to build the necessary in-frastructure.

3

1 – Introduction

Semantics is one of the angular stones of the Wisdom Web. It has its ownfoundation on the formal definition of “concepts” involved by web pages and byservices available on the web. Such a formal definition needs two critical compo-nents: knowledge models and associations between knowledge and resources. Theformers are known as Ontologies while the latter are usually referred to as SemanticAnnotations.

There are several issues related to the introduction of semantics on the web: howshould knowledge be modeled? how to associate knowledge to real world entities orto real world information?

The Semantic Web initiative builds up on the evidence that creating a common,monolithic, omni-comprehensive knowledge model is infeasible and, instead, assumesthat each actor playing a role on the web shall be able to define its own modelaccording to its own view of the world. In the SW vision, the global knowledgemodel is the result of a shared effort in building and linking the single modelsdeveloped all around the world, in a way like what happened for the current Web.Of course, in such a process conflicts will arise that will be solved by proper toolsfor mapping “conceptually similar” entities in the different models.

The definition of knowledge models alone, does not introduces semantics on theweb; in order to get such a result another step shall be done: resources must be“associated” to the models. The current trend is to perform this association bymeans of Semantic Annotations. A Semantic Annotation is basically a link betweena resource on the web, either a web page or a video or a music piece and one ormore concepts defined by an ontology. So, for example, the pages of a news site canbe associated to the concept news in a given ontology by means of an annotation,in form of a triple: about(site, news).

Semantic annotations are not only for documents or web pages but can be usedto associate semantics to nearly all kinds of informative, physical and meta-physicalentities in the world.

Many issues are involved in the annotation process. So, just for mentioning someof them, at which granularity these annotations shall be defined? How can we defineif a given annotation is trusted or not? Who shall annotate resources? Shall be thecreator or anyone on the web?

To answer these and the previous questions standard languages and practicesfor semantic markup are needed, together with a formal logic for reasoning aboutthe available knowledge and for turning implicit information into explicit facts. Atcontour, databases engines, for storing semantics rich data, and search engines foroffering new question-answering interfaces, constitute the informatics backbone ofthe new information highway so defined.

Other pillars of the forthcoming Wisdom Web, which are more “technical”, in-clude autonomic systems, planning and problem solving, etc. For them, great im-provements are currently being provided by the ever active community of Intelligent

4

1.3 – Contribution

and Multi-Agent systems. In this case the problematics involved are slightly differ-ent from the ones cited above although semantics integration can be interesting inmany aspects. The main research stream is in fact about designing machines ableto think and act rationally. In simple terms, the main concern, in this field, is todefine systems able to autonomously pursue goals and subgoals defined either byhumans or by other systems.

Meta-knowledge plays a crucial role in such a process, providing means for mod-eling spatial and temporal constraints, or conflicts, that may arise between agent’sgoals and can be, in turn, strongly based on semantics. Knowledge modeling can, infact, support the definition and discovery of similarities and relationships betweenconstraints, in a way that is independent from the dialects with which each singleagent composes its world understanding.

Personalization, in the end, is not a new discipline and is historically concernedwith interaction modalities between users and machines, defining methodologies andinstruments to design usable interfaces, for every people, be they “normal” users ordiversely able people. It has impact on, or interacts with, many research fieldsstarting from Human Computer Interaction and encompassing Statistical Analysisof user preferences and Prediction Systems. Personalization is a key factor for theWisdom Web exploitation: until usable and efficient interfaces will not be available,in fact, this new web, in which available information will be many order of magnitudewider that the current one, will not be adopted.

The good news is that all the cited issues can be solved without requiring arevolutionary scientific progress. We can in fact reasonably claim that the challengeis only in engineering and technological adoption, as partial solutions to all therelevant parts of the scenario already exist. At the present, the greatest needsappear to be in the area of integration, standardization, development of tools andadoption by the users.

1.3 Contribution

In this Thesis, methodologies and techniques for paving the way that starts fromnowadays web applications and leads to the Wisdom Web have been studied, witha particular focus on information retrieval systems, content management systemsand e-Learning systems. A new platform for supporting the easy integration ofsemantic information into nowadays systems has been designed and developed, andhas been applied to several case studies: a custom-made CMS, a publicly availablee-Learning system (Moodle [1]), an intelligent proxy for web navigation (Muffin [2])and a life-long learning system developed in the context of the CABLE project [3](a EU-funded Minerva Project).

In addition some extensions of the proposed system to environments sharing

5

1 – Introduction

with the Web the underlying infrastructure and the communication and interactionparadigms have been studied. A case study is provided for domotic systems.

Several contributions to the state of art in semantic systems can be found in thecomponents of the platform including: an extension of the T.R. Gruber ontologydefinition, which allows to transparently support multilingual knowledge domains, anew annotation “expansion” system that allows to leverage the information encodedinto ontologies for extending semantic annotations, and a new “conceptual” searchparadigm based on a compact representation of semantic annotations called Con-ceptual Spectrum. The semantic platform discussed in this thesis is named H-DOSE(Holistic Distributed Semantic Elaboration Platform) and is currently available asan Open Source Project on Sourceforge: http://dose.sourceforge.net.

H-DOSE has been entirely developed in Java for allowing better interoperabilitywith already existing web systems and is currently deployed as a set of web servicesrunning on the Apache Tomcat servlet container. It is, at now, available in twodifferent forms, one intended for micro enterprises, characterized by a small footprinton the server onto which is run, and one for small and medium enterprises thatintegrates the ability to distribute jobs on different machines, by means of agents,and that includes principles of autonomic computing for keeping the underlyingknowledge base constantly up-to-date. Rather than being an isolated attempt tosemantics integration in the current web, H-DOSE is still a very active project and isundergoing several improvements and refinements for better supporting the indexingand retrieval of non-textual information such as video clips, audio pieces, etc. Thereis also some ongoing work on the integration of H-DOSE into competitive intelligencesystems as done by IntelliSemantic: a start-up of the Turin’s Polytechnic that buildsits business plan on the adoption of semantic techniques, and in particular of theH-DOSE platform, for patent discovery services.

Eventually, several side issues related to semantics handling and deployment onweb applications have been addressed during the H-DOSE design, some of them willalso be presented in this thesis. A newly designed ontology visualization tool basedon multi-dimensional information spaces is an example.

1.4 Structure of the Thesis

The remainder of this thesis is organized as follows:

Chapter 2 introduces the vision of Semantic Web and discusses the data-model,standards, and technologies used to bring this vision into being. These buildingblocks are used in the design of H-DOSE trying to maximize the reuse of alreadyavailable and well tested technologies thus avoiding to reinvent the wheel.

Chapter 3 moves in parallel with the preceding chapter introducing an overview

6

1.4 – Structure of the Thesis

of currently available web applications with a particular focus on systems for infor-mation management such as Content Management Systems, Indexing and retrievalsystems, e-Learning systems. For every category of application, the points in whichsemantics can give substantial improvements either in effectiveness (performance)or in user experience are evidenced.

Chapter 4 defines the requirements for the H-DOSE semantic platform, as theyemerge from interviews with web actors such as content publishers, site administra-tors and so on.

Chapter 5 introduces the H-DOSE logical architecture, and uses such architec-ture as a guide for discussing the basic principles and assumptions on to which theplatform is built. For every innovative principle the strength points are evidencedtogether with the weaknesses emerged either during the presentations of such ele-ments in international conferences and workshops or during the H-DOSE design anddevelopment process.

Chapter 6 describes in deep detail the H-DOSE platform, focusing on the roleand the interactions that involve every single component of the platform. The mainconcern of this chapter is to provide a complete view of the platform, in its morespecific aspects, discussing the adopted solutions from a “software engineering” pointof view.

Chapter 7 presents the case studies that constituted the benchmark of theH-DOSE platform. Each case study is addressed separately starting from a briefdescription of requirements and going through the integration design process, thedeployment of the H-DOSE platform and the phase of results gathering and analysis.

Chapter 8 is about the H-DOSE related tools developed during the platformdesign and implementation. They include a new ontology visualization tool and agenetic algorithm for semantic annotations refinement.

Chapter 9 discusses the extension of H-DOSE principles and techniques tonon-Web scenarios, with a particular focus on domotics. An ongoing project onsemantics reach house gateways is described highlighting how the lessons learnedin the design and development of H-DOSE can be applied in a complete differentscenario, still retaining their valuability.

Chapter 10 presents the related works in the field of both Semantic Web andWeb Intelligence, with a particular focus on semantic platforms and semantics inte-gration on the Web.

Chapter 11 eventually concludes the thesis and provides an overview on possiblefuture works.

7

Chapter 2

The Semantic Web vision

This chapter introduces the vision of Semantic Web and discusses thedata-model, standards, and technologies used to bring this vision intobeing. These building blocks are used in the design of H-DOSE trying tomaximize the reuse of already available and well tested technologies thusavoiding to reinvent the wheel.

The Semantic Web is developed layer by layer; the pragmatic justification for sucha procedure is that it is easier to achieve consensus on small steps, whereas it ismuch harder to make everyone agree on very wide proposals. In fact there are manyresearch groups that are exploring different and sometimes conflicting solutions.After all, competition is one of the major driving force for scientific development.Such a competition makes very hard to reach agreements on wide steps and oftenonly a partial consensus can be achieved. The Semantic Web builds upon the stepsfor which consensus can be reached, instead of waiting to see which alternativeresearch line will be successful in the end.

The Semantic Web is such that companies, research groups and users must buildtools, add content and use that content. It is certainly myopic to wait until the fullvision will materialize: it may take another ten years to realize the full extent ofSW, and many years more for the Wisdom Web.

In evolving from one layer to another, two principles are usually followed:

• Downward compatibility: applications, or agents, fully compliant with alayer shall also be aware of the lower layers, i.e., they shall be able to interpretand use information coming from those layers. As an example we can consideran application able to understand the OWL semantics. The same applicationshall also take full advantage of information encoded in RDF and RDF-S [4].

• Partial upward understanding: agents fully aware of a given layer should

8

take, at least partial, advantage from information at higher levels. So, a RDF-aware agent should also be able to use information encoded in OWL [5], ig-noring those elements that go beyond RDF and RDF Schema.

Figure 2.1. The Semantic Web ”cake”.

The layered cake of the Semantic Web is shown in Figure 2.1 and describes themain components involved in the realization of the Semantic Web vision (due toTim Berners Lee). At the bottom it is located XML (eXtensible Markup Language)a language for writing well structured documents according to a user-defined vocab-ulary. XML is a “de facto” standard for the exchange of information over the WorldWide Web. On the top of XML builds up the RDF layer.

RDF is a simple data model for writing statements about Web objects. RDF isnot XML, however it has a XML-based syntax, so it is located, in the cake, over theXML layer.

RDF-Schema defines the vocabulary used in RDF data models. It can be seenas a very primitive language for defining ontologies, as it provides the basic buildingblocks for organizing Web objects into hierarchies. Supported constructs include:classes and properties, the subClass and subProperty relations and the domain andrange restrictions. RDF-Schema uses a RDF syntax.

The Logic layer is used for further enhancing the ontology support offered byRDF-Schema, thus allowing to model application-specific declarative knowledge.

The Proof layer, instead, involves the process of deductive reasoning as well asthe process of providing and representing proofs in Web languages. Applicationslying at the proof level shall be able to reason about the knowledge data defined inthe lower layers and to provide conclusions together with “explanations” (proofs)about the deductive process leading to them.

9

2 – The Semantic Web vision

The Trust layer, in the end, will emerge through the adoption of digital signaturesand other kinds of knowledge, based on recommendations by trusted agents, byrating and certification agencies or, even, by consumer organizations. The expression“Web of Trust” means that the trust over the Web will be organized in the samedistributed and sometimes chaotic way as the WWW itself. Trust is crucial for thefinal exploitation of the Semantic Web vision: until users will not have trust in itsoperations (security) and in quality of information provided (relevance) the SW willnot reach its full potential.

2.1 Semantic Web Technologies

The Semantic Web cake depicted above builds upon the so-called Semantic WebTechnologies. These technologies empower the foundational components of the SW,which are introduced separately in the following subsections.

2.1.1 Explicit Metadata

At now, the World Wide Web is mainly formatted for human users rather thanfor programs. Pages either static or dynamically built using information storedin databases are written in HTML or XHTML. A typical web page of an ICTconsultancy agency can look like this:

<html>

<head></head>

<body>

<h1> SpiderNet internet consultancy,

network applications and more </h1>

<p> Welcome to the SpiderNet web site, we offer

a wide variety of ICT services related to the net.

<br/> Adam Jenkins, our graphics designer has designed many

of the most famous web sites as you can see in

<a href=’’gallery.html’’>the gallery</a>.

Matt Kirkpatrick is our Java guru and is able to develop

any new kind of functionalities you may need.

<br> If you are seeking a great new opportunity

for your business on the web, contact us at the

following e-mails:

<ul>

<li>[email protected]</li>

<li>[email protected]</li>

10

2.1 – Semantic Web Technologies

</ul>

Or you may visit us in the following opening hours

<ul>

<li>Mon 11am - 7pm</li>

<li>Tue 11am - 2pm</li>

<li>Wed 11am - 2pm</li>

<li>Thu 11am - 2pm</li>

<li>Fri 2pm - 9pm</li>

</ul>

Please note that we are closed every weekend and every festivity.

</p>

</body>

</html>

For people, the provided information is presented in a rather satisfactory way,but for machines this document results nearly incomprehensible. Keyword-basedtechniques might be able to identify the words web site, graphics designer and Java.And an intelligent agent, could identify the email addresses and the personnel of theagency, and with a little bit of heuristics it might associate each employee with thecorrect e-mail address. But it will have troubles for distinguishing who is the graphicsdesigner and who is the Java developer, and even more difficulties in capturing theopening hours (for which the agent would have to understand what festivities arecelebrated during the year, and in which days, depending on the location of theagency, which in turn is not explicitly available in the web page.). The SemanticWeb tries to address these issues not by developing super-intelligent agents able tounderstand information as humans. Instead it acts on the HTML side, trying toreplace this language with more appropriate languages so that web pages could carrytheir content in a machine processable form, still remaining visually appealing forthe users. In addition to formatting information for human users, these new webpages will also carry information about their content, such as:

<company type=’’consultancy’’>

<service>Web Consultancy</service>

<products> Web pages, Web applications </products>

<staff>

<graphicsDesigner>Adam Jenkins</graphicsDesigner>

<javaDeveloper>Matt Kirkpatrick</javaDeveloper>

</staff>

</company>

11


This representation is much easier for machines to understand and is usuallyknown as metadata that means: data about data. Metadata encodes, in a sense,the meaning of data, so defining the semantics of a web document (thus the termSemantic Web).

2.1.2 Ontologies

The term ontology stems from philosophy. In that context, it is used to name asubfield of philosophy, namely the study of the nature of existence (from the greekoντøλøγια), the branch of metaphysics concerned with identifying, in general terms,the kinds of things that actually exist, and how to describe them. For example theobservation that the world is made up of specific entities that can be grouped inabstract classes based on shared properties is a typical ontological commitment.

For what concerns nowadays technologies, ontology has been given a specificmeaning that is quite different from the original one. For the purposes of this thesisthe T.R. Gruber’s definition, later refined by R. Studer can be adopted: An ontologyis an explicit and formal specification of a conceptualization.

In other words, an ontology formally describes a knowledge domain. Typically,an ontology is composed of a finite list of terms and the relationships between theseterms. The terms denote important concepts (classes of objects) of the domain.

Relationships include, among the others, hierarchies of classes. A hierarchyspecifies a class C to be a subclass of another class C ′ if every object in C is alsoincluded in C ′. Apart from the subclass relationship (also known as “is A” relation),ontologies may include information such as:

• properties (X makes Y )

• value restrictions (only smiths can make iron tools)

• disjointness statements (teachers and secretary staff are disjoint)

• specification of logical relationships between objects

In the context of the web, ontologies provide a shared understanding of a domain.Such an understanding is necessary to overcome differences in terminology. As anexample a web application may use the term “ZIP” for the same information that inanother one is denoted as “area code”. Another problem is when two applicationsuse the same term with different meanings. Such differences can be overcome byassociating a particular terminology with a shared ontology, and/or by definingmappings between different ontologies. In both cases, it is easy to notice thatontologies support semantic interoperability.

Ontologies are also useful for improving the results of Web searches. The searchengine can look for pages that refer to a precise concept, or set of concepts, in

12

2.2 – Logic

an ontology instead of collecting all pages in which certain, possibly ambiguous,keywords occur. In the same way as above, ontologies allow to overcome differ-ences in terminology between Web pages and queries. In addition, when performingontology-based searches it is possible to exploit generalization and specializationinformation. If a query fails to find any relevant documents (or provides too manyresults), the search engine can suggest to the user a more general (specific) query[6]. It is even conceivable that the search engine runs such queries proactively, inorder to reduce the reaction time in case the user adopts such suggestion.

Ontologies can even be used to better organize Web sites and navigation ofthem. Many nowadays sites offer on the left-hand side of the pages the top levelsof a concept hierarchy of terms. The user may click on them to expand the subcategories and to finally reach new pages in the same site.

In the Semantic Web layered approach, ontologies are located in between thethird layer of RDF and RDF-S and the fourth level of abstraction where the WebOntology Language (OWL) resides.

2.2 Logic

Logic is the discipline that studies the principles of reasoning; in general, it offersformal languages for expressing knowledge and well-understood formal semantics.Logic usually works with the so-called declarative knowledge, which describes whatholds without caring about how it can be deduced.

Deduction can be performed by automated reasoners: software entities that havebeen extensively studied in Artificial Intelligence. Logic deduction (inference) allowsto transform implicit knowledge defined in a domain model (ontology) into explicitknowledge. For example, if a knowledge base contains the following axioms in pred-icate logic,

human(X) → mammal(X)

Ph.Dstudent(X) → human(X)

Ph.Dstudent(Dario)

an automated inferencing engine can easily deduce that

human(Dario)

mammal(Dario)

Ph.Dstudent(X) → mammal(X)

Logic can therefore be used to uncover ontological knowledge that is implicitly givenand, by doing so, it can help revealing unexpected relationships and inconsistencies.

13


But logic is more general than ontologies and can also be used by agents formaking decisions and selecting courses of action, for example.

Generally there is a trade-off between expressive power and computational ef-ficiency. The more expressive a logic is, the more computationally expensive itbecomes for drawing conclusions. And drawing conclusions can sometimes be impos-sible when non-computability barriers are encountered. Fortunately, a considerablepart of the knowledge relevant to the Semantic Web seems to be of a relatively re-stricted form, and the required subset of logics is almost tractable, and is supportedby efficient reasoning tools.

Another important aspect of logic, especially in the context of the Semantic Web,is the ability to provide explanations (proofs) for the conclusions: the series of infer-ences can be retraced. Moreover, AI researchers have developed ways of presentingproofs in a human-friendly fashion, by organizing them as natural deductions andby grouping, in a single element, a number of small inference steps that a personwould typically consider a single proof step.

Explanations are important for the Semantic Web because they increase theusers’ confidence in Semantic Web agents. Even Tim Berners Lee speaks of a “Ohyeah?” button that would ask for explanation.

Of course, for logic to be useful on the Web, it must be usable in conjunction withother data, and it must be machine processable as well. From these requirementsstem the nowadays research efforts on representing logical knowledge and proofs inWeb languages. Initial approaches work at the XML level, but in the future rulesand proofs will need to be represented at the level of ontology languages such asOWL.

2.2.1 Agents

Agents are software entities that work autonomously and proactively. Conceptuallythey evolved out of the concepts of object-oriented programming and of component-based software development.

According to the Tim Berners-Lee’s article [7], a Semantic Web agent shall beable to receive some tasks and preferences from the user, seek information from Websources, communicate with other agents, compare information about user require-ments and preferences, select certain choices, and give answers back to the user.Agents will not replace human users on the Semantic Web, nor will they necessarilymake decisions. In most cases their role will be to collect and organize information,and present choices for the users to select from.

Semantic Web agents will make use of all the outlined technologies, in particular:

• Metadata will be used to identify and extract information from Web Sources.

14

2.3 – Semantic Web Languages

• Ontologies will be used to assist in Web searches, to interpret retrieved infor-mation, and to communicate with other agents.

• Logic will be used for processing retrieved information and for drawing con-clusions

2.3 Semantic Web Languages

2.3.1 RDF and RDF-Schema

RDF is essentially a data-model and its basic building block is a object-attribute-value triple, called statement. An example of statement is: Kimba is a Lion.

This abstract data-model needs a concrete syntax to be represented and ex-changed and RDF has been given a XML syntax. As a result, it inherits theadvantages of the XML language. However it is important to notice that otherrepresentations of RDF, not in XML syntax, are possible, N3 is an example.

RDF is, by itself, domain independent: no assumptions on a particular domainof application are done. It is up to each user to define the terminology to be usedin his/her RDF data-model using a schema language called RDF-Schema (RDF-S).RDF-Schema defines the terms that can be used in a RDF data-model. In RDF-Swe can specify which objects exist and which properties can be applied to them, andwhat values they can take. We can also describe the relationships between objectsso, for example, we can write: The lion is a carnivore.

This sentence means that all the lions are carnivores. Clearly there is an intendedmeaning for the “is a” relation. It is not up to applications to interpret the “is a”term; its intended meaning shall be respected by all RDF processing softwares. Byfixing the meaning of some elements, RDF-Schema enables developers to modelspecific knowledge domains.

The principal elements of the RDF data-model are: resources, properties andstatements.

Resources are the object we want to talk about. Resources may be authors,cities, hotels, places, people, etc. Every resource is identified by a sort of identityID, called URI. URI stands for Uniform Resource Identifier an provides means touniquely identify a resource, be it available on the web or not. URIs do not implythe actual accessibility of a resource and therefore are suitable not only for webresources but also for printed books, phone numbers, people and so on.

Properties are special kind of resources that describe relations between theobjects of the RDF data-model, for example: “written by”, “eats”, “lives”, “title”,“color”, “age”, and so on. Properties are also identified by URIs. This choice allows,from one side to adopt a global, worldwide naming scheme, on the other side to writestatements having a property either as subject or as object. URIs also allow to solve

15


the homonym problem that has been the plague of distributed data representationuntil now.

Statements assert the properties of resources. They are object-attribute-valuetriples consisting respectively of a resource, a property and a value. Values caneither be resources or literals. Literals are atomic values (string), that can have aspecific XSD type, xsd:double as an example. A typical example of statement is:

the H-DOSE website is hosted by www.sourceforge.net.

This statement can be rewritten in a triple form:

(‘‘H-DOSE web site’’,’’hosted by’’,’’www.sourceforge.net’’)

and in RDF it can be modeled as:

<?xml version=’’1.0’’ encoding=’’UTF-8’’?>

<rdf:RDF

xmlns:rdf=’’http://www.w3.org/1999/02/22-rdf-syntax-ns#’’

xmlns:mydomain=’’http://www.mydomain.net/my-rdf-ns’’>

<rdf:Description about=’’http://dose.sourceforge.net’’>

<mydomain:hostedBy>

http://www.sourceforge.net

</mydomain:hostedBy>

</rdf:Description>

</rdf:RDF>

One of the major strength points of RDF is the so-called reification: in RDF itis possible to make statements about statements, such as:

Mike thinks that Joy has stolen its diary

This kind of statement allows to model belief or trust on other statements, whichis important for some kinds of application. In addition, reification allows to modelnon-binary relations using triples. The key idea, since RDF only supports triples,i.e., binary relationships, is to introduce an auxiliary object and relate it to each ofthe parts of the non-binary relation through the properties subject, predicate andobject.

So, for example, if we want to represent the tertiary relationship referee(X,Y,Z)having the following well defined meaning:

X is the referee in a tennis game between players X and Y.

16


Figure 2.2. Representation of a tertiary predicate.

we have to break it in three binary relations adding an auxiliary resource calledtennisGame, as in Figure 2.2.

RDF critical view

As already cited RDF only uses binary properties. This restriction could be a quitelimiting factor since usually we adopt predicates with more than two arguments.Fortunately, reification allows to overcome this issue. However some critic aspectsarise from the adoption of the reification mechanism. As first, although the solutionis sound, the problem remains that not binary predicates are more natural withmore arguments. Secondly, reification is a quite complex and powerful techniquewhich may appear misplaced for a basic layer of the Semantic Web, instead it wouldhave appeared more natural to include it in more powerful layers that provide richerrepresentational capabilities.

In addition, the XML syntax of RDF is quite verbose and can easily become toocumbersome to be managed directly by users, especially for huge data-models. Socomes the adoption of user-friendly tools that automatically translate higher levelrepresentations into RDF.

Eventually, RDF is a standard format therefore the benefits of drafting data inRDF can be seen as similar to drafting information in HTML in the early days ofthe Web.

From RDF/RDF-S to OWL

The expressiveness of RDF and RDF-Schema (described above) is very limited (andthis is a deliberate choice): RDF is roughly limited to model binary relationships andRDF-S is limited to sub-class hierarchies and property hierarchies, with restrictionson the domain and range of the lasts.

17


However a number of research groups have identified different characteristic use-cases for the Semantic Web that would require much more expressiveness than RDFand RDF-S offer. Initiatives from both Europe and United States came up withproposals for richer languages, respectively named OIL and DAML-ONT, whosemerging DAML+OIL was taken by the W3C as the starting point for the WebOntology Language OWL.

Ontology languages must allow users to write explicit, formal conceptualizationsof domain knowledge, the main requirements are therefore:

• a well defined syntax,

• a formal semantics,

• an efficient reasoning support,

• a sufficient expressive power,

• a convenience of expression.

The importance of a well-defined syntax is clear, and known from the area of pro-gramming languages: it is a necessary condition for “machine understandability”and thus for machine processing of information. Both RDF/RDF-S and OWL havethis kind of syntax. A formal semantics allows to describe the meaning of knowledgeprecisely. Precisely means that semantics does not refer to subjective intuitions andis not open to different interpretations by different people (or different machines).The importance of a formal semantics is well known, for example, in the domain ofmathematical logic. Formal semantics is needed for allowing people to reason aboutknowledge. This, for ontologies, means that we may reason about:

• Class membership. If x is an instance of a class C, and C is a subclass of D,we can infer that x is also an instance of D.

• Equivalence of classes. If a class A is equivalent to a class B, and B is equiv-alent to C, then A is equivalent to C, too.

• Consistency. Let x be an instance of A, and suppose that A is a subclass ofB ∩ C and of D. Now suppose that B and D are disjoint. There is a clearinconsistence in our model because A should be empty but has the instancex. Inconsistencies like this indicate errors in the ontology definition.

• Classification. If we have declared that certain property-value pairs are suffi-cient conditions for membership in a class A, then if an individual (instance)x satisfies such conditions, we can conclude that x must be an instance of A.

18


Semantics is a prerequisite for reasoning support. Derivation such as the preced-ing ones can be made by machines instead of being made by hand. Reasoning isimportant because allows to:

• check the consistency of the ontology and of the knowledge model,

• check for unintended relationships between classes,

• automatically classify instances.

Automatic reasoning allows to check much more cases than could be checked man-ually. Such checks become critical when developing large ontologies, where multipleauthors are involved, as well as when integrating and sharing ontologies from varioussources.

Formal semantics is obtained by defining an explicit mapping between an ontol-ogy language and a known logic formalism, and by using automated reasoners thatalready exist for that formalism. OWL, for instance, is (partially) mapped on de-scription logic, and makes use of existing reasoners such as Fact, Pellet and RACER.Description logics are a subset of predicate logic for which efficient reasoning supportis possible.

2.3.2 OWL languages

The full set of requirements for an ontology language are: efficient reasoning supportand convenience of expression, for a language as powerful as the combination ofRDF-Schema with full logic. These requirements have been the main motivationfor the W3C Ontology Group to split OWL in three different sub languages, eachtargeted at different aspects of the full set of requirements.

OWL Full

The entire Web Ontology Language is called OWL Full and uses all the OWL lan-guages primitives. It allows, in addition, the combination of these primitives inarbitrary ways with RDF and RDF-Schema. This includes the possibility (alreadypresent in RDF) of changing the meaning of the predefined (RDF and OWL) prim-itives, by applying the language primitives to each other. For example, in OWL fullit is possible to impose a cardinality constraint on the Class of all classes, essentiallylimiting the number of classes that can be described in an ontology.

The advantage of OWL Full is that it is fully upward-compatible with RDF,both syntactically and semantically: any legal RDF document is also a legal OWLFull document, and any valid RDF/RDF-S conclusion is also a valid OWL Fullconclusion. The disadvantage of the OWL Full is that the language has become

19


so powerful as to be undecidable, so dashing any hope of complete (or efficient)reasoning support.

OWL DL

In order to re-obtain computational efficiency, OWL DL (DL stands for DescriptionLogic) is a sub language of OWL Full that restricts how the constructors fromRDF and OWL may be used: application of OWL’s constructors to each other isprohibited so as to ensure that the language corresponds to a well studied descriptionlogic.

The advantage of this is that it permits efficient reasoning support. The dis-advantage is the lost of full compatibility with RDF: an RDF document will, ingeneral, have to be extended in some ways and restricted in others before becominga legal OWL DL document. Every OWL DL document is, in turn, a legal RDFdocument.

OWL Lite

An even further restriction limits OWL DL to a subset of the language constructors.For example, OWL Lite excludes enumerated classes, disjointness statements, andarbitrary cardinality.

The advantage of this is a language that is both easier to grasp for users and easierto implement for developers. The disadvantage is of course a restricted expressivity.

2.3.3 OWL in a nutshell

Header

OWL documents are usually called OWL ontologies and they are RDF documents.The root element of an ontology is an rdf:RDF element, which specifies a numberof namespaces:

<rdf:RDF

xmlns:owl = ’http://www.w3.org/2002/07/owl#’’

xmlns:rdf = ’’http://www.w3.org/1999/02/22-rdf-syntax-ns#’’

xmlns:rdfs = ’’http://www.w3.org/2000/01/rdf-schema#’’

xmlns:xsd = ’’http://www.w3.org/2001/XMLSchema#’’>

An OWL ontology can start with a set of assertions for house keeping purpose. Theseassertions are grouped under an owl:Ontology element, which contains comments,version control, and inclusion of other ontologies.

20


<owl:Ontology rdf:about=’’ ’’>

<rdfs:comment>A simple OWL ontology </rdfs:comment>

<owl:priorVersion

rdf:resource=’’http://www.domain.net/ontologyold’’/>

<owl:imports

rdf:resource=’’http://www.domain2.org/savanna’’/>

<rdfs:label>Africa animals ontology</rdfs:label>

</owl:Ontology>

The most important of the above assertions is the owl:imports, which lists otherontologies whose content is assumed to be part of the current ontology. It is impor-tant to be aware that the owl:imports is a transitive property: if the ontology Aimports the ontology B, and the ontology B imports the ontology C, then A is alsoimporting C.

Classes

Classes are defined using the owl:Class element and can be organized in hierarchiesby means of the rdfs:subClassOf construct.

<owl:Class rdf:ID=’’Lion’’>

<rdfs:subClassOf rdf:resource=’’#Carnivore’’/>

</owl:Class>

It is also possible to indicate that two classes are completely disjoint such as theherbivores and the carnivores, using the owl:disjointWith construct.

<owl:Class rdf:about=’’#carnivore’’>

<owl:disjointWith rdf:resource=’’#herbivore’’/>

<owl:disjointWith rdf:resource=’’#omnivore’’/>

</owl:Class>

Equivalence of classes may be defined using the owl:equivalentClass element.Eventually there are two predefined classes, owl:Thing and owl:Nothing, which,respectively, indicate the most general class containing everything in a OWL doc-ument, and the empty class. As a consequence, every owl:Class is a subclass ofowl:Thing and a superclass of owl:Nothing.

Properties

In OWL are defined two kinds of properties:

21


• Object properties, which relate objects to other objects. Example is, in thesavanna ontology, the relation eats.

• Datatype properties, which relate objects with datatype values. Examples areage, name, and so on. OWL has not any predefined data types, nor does itprovide special definition facilities. Instead, it allows the use of XML-Schemadatatypes, making use of the layered architecture of the Semantic Web.

Here there are two examples, the first for a Datatype property while the second isfor Object properties:

<owl:DatatypeProperty rdf:ID=’’age’’>

<rdfs:range rdf:resource=’’&xsd;#nonNegativeInteger’’/>

</owl:DatatypeProperty>

<owl:ObjectProperty rdf:ID=’’eats’’>

<rdfs:domain rdf:resource=’’#animal’’/>

</owl:ObjectProperty>

More than one domain and range can be declared, in such a case the intersectionof the domains (ranges) is taken. OWL allows to identify “inverse properties”, forthem a specific owl element exists (owl:inverseOf) and has the effect of relating aproperty with its inverse by inter-changing the domain and range definitions.

<owl:ObjectProperty rdf:ID=’’eatenBy’’>

<owl:inverseOf rdf:resource=’’#eats’’/>

</owl:ObjectProperty>

Eventually equivalence of properties can be defined through the use of the elementowl:equivalentProperty.

Restrictions on properties

In RDFS it is possible to declare a class C as a subclass of a class C ′, then everyinstance of C will be also an instance of C ′. OWL allows to specify classes C ′

that satisfy some precise conditions, i.e., all instances of C satisfy the conditions.This is done by defining C as a subclass of the class C ′′ which collects all theobjects that satisfy the conditions. In general, C ′′ remains anonymous. In OWLthere are three specific elements for defining classes basing on restrictions, theyare owl:allValuesFrom, owl:someValuesFrom and owl:hasValue, and they arealways nested into a owl:Restriction element. The owl:allValuesFrom specify auniversal quantification (∀) while the owl:someValuesFrom defines and existentialquantification (∃).

22


<owl:Class rdf:about=’’#firstYearCourse’’>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource=’’#isTaughtBy’’/>

<owl:allValuesFrom

rdf:resource=’’#Professor’’/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

This example requires every person who teaches an instance of “firstYearCourse”,e.g., a first year subject, to be a professor (universal quantification).

<owl:Class rdf:about=’’#academicStaffMember’’>

<rdfs:subClassOf>

<owl:Restriction>

<owl:onProperty rdf:resource=’’#teaches’’/>

<owl:someValuesFrom

rdf:resource=’’#undergraduateCourse’’/>

</owl:Restriction>

</rdfs:subClassOf>

</owl:Class>

This second example, instead, requires that there exist an undergraduate coursetaught by an instance of the class of academic staff members (existential quantifi-cation).

In general an owl:Restriction element contains an owl:onProperty elementand one ore more restriction declarations. Restrictions for defining the cardinalityof a given class are also supported through the elements:

• owl:minCardinality,

• owl:maxCardinality,

• owl:Cardinality.

The latter is a shortcut for a cardinality definition in which owl:minCardinality

and owl:maxCardinality assume the same value.

Special properties

Some properties of the property element can be defined directly:

23


• owl:TransitiveProperty defines a transitive property, such as “has bettergrade than”, “is older than”, etc.

• owl:SymmetricProperty defines a symmetric property, such as “has samegrade as” or “is sibling of”.

• owl:FunctionalProperty defines a property that has at most one value foreach object, such as “age”, “height”, “directSupervisor”, etc.

• owl:InverseFunctionalProperty defines a property for which two differentobjects cannot have the same value, for example “is identity ID for”.

Instances

Instances of classes, in OWL, are declared as in RDF:

<rdf:Description rdf:ID=’’Kimba’’>

<rdf:type rdf:resource=’’#Lion’’/>

</rdf:Description>

OWL, unlike typical database systems, does not adopt a unique-names assump-tion therefore two instances that have different names are not required to be actuallytwo different individuals. Then, to ensure that different individuals are recognizedby automated reasoners as such, inequality must be explicitly asserted.

<lecturer rdf:ID=’’91145’’>

<owl:differentFrom rdf:resource=’’#98760’’/>

</lecturer>

Because such inequality statements frequently occur, and the required number ofstatements would explode for stating the inequality of a large number of individual,OWL provides a shorthand notation to assert the pairwise inequality for all theindividuals in a list: owl:AllDifferent.

<owl:AllDifferent>

<owl:distinctMembers rdf:parseType=’’Collection’’>

<lecturer rdf:about=’’#91345’’/>




</owl:distinctMembers>

</owl:AllDifferent>

Note that owl:distinctMembers can only be used in combination with theowl:AllDifferent element.

24

Chapter 3

Web applications for InformationManagement

This chapter introduces an overview of currently available web appli-cations with a particular focus on systems for information managementsuch as Content Management Systems, Indexing and retrieval systems, e-Learning systems. For every category of applications the points in whichsemantics can give substantial improvements either in effectiveness (per-formance) or in user experience are evidenced.

Many businesses and activities, either on the web or not, are human and knowledgeintensive. Examples include consulting, advertising, media, high-tech, pharmaceuti-cal, law, software development, etc. Knowledge intensive organizations have alreadyfound that a large number of problems can be attributed to un-captured and un-shared product and process knowledge, as well as to lacks of “who knows what”information, to the need to capture lessons learned and best practices, and to theneed of more effective distance collaboration.

These realizations are leading to a growing call for knowledge management.Knowledge capture and learning can happen ad-hoc (e.g. knowledge sharing aroundthe coffee maker or problem discussions around the water cooler). Sharing, however,is more efficient when organized; moreover, knowledge must be captured, stored andorganized according to the context of each company, or organization, in order to beuseful and efficiently disseminated.

The knowledge items that an organization usually needs to manage can havedifferent forms and contents. They include manuals, correspondence with vendorsand customers, news, competitors intelligence, and knowledge derived from workprocesses (e.g. documentation, proposals, project plans, etc.), possibly in differentformats (text, pictures, video). The amount of information and knowledge that amodern organization must capture, store and share, the geographic distribution of

25

3 – Web applications for Information Management

sources and consumers, and the dynamic evolution of information makes the use oftechnology nearly mandatory.

3.1 The knowledge sharing and life cycle model

One of the most adopted knowledge sharing model has been developed by Nonakaand Takenuchi in 1995 [8] and is called the “tacit-explicit model” (see Figure 3.1).Referring to this model, tacit knowledge is knowledge that rests with the employee,

Figure 3.1. The Tacit-Explicit Model.

and explicit knowledge is knowledge that resides in the knowledge base. Conversionof knowledge from one form to another leads often to the creation of new knowledge;such conversion may follow four different patterns:

• Explicit-to-explicit knowledge conversion, or “Combination”, is the reconfigu-ration of explicit knowledge through sorting, adding, combining and catego-rizing. Often this process leads to new knowledge discovery.

• Explicit-to-tacit knowledge conversion, or “Internalization”, takes place whenone assimilates knowledge acquired from knowledge items. This internalizationcontributes to the user’s tacit knowledge and helps him/her in making futuredecisions.

26

3.2 – Software tools for knowledge management

• Tacit-to-explicit knowledge conversion, or “Externalization”, involves trans-forming context-based facts into context-free knowledge, with the help ofanalogies. Tacit knowledge is usually personal and depends on the person’sexperiences in various conditions. As a consequence, it is a strongly contextu-alized resource. Once explicit, it will not retain much value unless context in-formation is somehow preserved. Externalization can take two forms: recordedor unrecorded knowledge.

• Tacit-to-tacit knowledge conversion or “Socialization”, occurs by sharing ex-periences, by working together in a team, and, more in general, by directexchange of knowledge. Knowledge exchange at places where people social-ize, such as around the coffeemaker or the water cooler, leads to tacit-to-tacitconversion.

The “knowledge life cycle” takes the path of knowledge creation/acquisition, ofknowledge organization and storage, of distribution of knowledge, of applicationand reuse, and finally ends up in creation/acquisition again, in a sort of a spiralshaped process (Figure 3.1).

Tacit knowledge has to be made explicit in order to be captured and madeavailable to all the actors of a given organization. This is accomplished with the aidof knowledge acquisition or knowledge creation tools. Knowledge acquisition buildsand evolves the knowledge bases of organizations. Knowledge organization/storagetakes place through activities by which knowledge is organized, classified and storedin repositories. Explicit knowledge needs to be organized and indexed for easybrowsing and retrieving. It must be stored efficiently to minimize the requiredstorage space.

Knowledge can be distributed through various channels such as training pro-grams, automatic knowledge distribution systems and knowledge-based expert sys-tems. Without regarding the way knowledge is distributed, making the knowledgebase of a given organization available to the users, i.e., distributing the right infor-mation at the right place and time, is one of the most critical assets for a nowadayscompany.

3.2 Software tools for knowledge management

Knowledge management (KM) shall be supported by a collection of technologies forauthoring, indexing, classifying, storing, contextualizing and retrieving information,as well as for collaboration and application of knowledge. A friendly front-end anda robust back-end are the basis for KM software tools. Involved elements include,among the others, content management systems for efficiently publishing and sharingknowledge and data sources, indexing, classification and retrieval systems to easy

27


access information stored in knowledge bases, e-Learning systems for allowing usersto perform “Internalization”, possibly in a personalized way.

These technologies, are still far from being completely semantic, i.e. based oncontext and domain aware elements able to fully support knowledge operations ata more high level, similar to what usually done by humans and, at the same time,being still accessible to machines.

The following sub-sections take as a reference three wide diffused technologiessuch as content management systems (CMS), search engines and e-Learning systems,and for each of them discuss the basic principles and “how and where” semanticscan improve their performance. Performance is evaluated both from the functionalpoint of view and from the knowledge transfer point of view.

For each technology, a separated subsection is therefore provided with a briefintroduction, a discussion about currently available solutions and some considera-tions about the integration of semantic functionalities. Finally, the shortcomingsand solutions identified will define the requirements for a general purpose platformable to provide semantics integration on the web, with a minimal effort.

3.2.1 Content Management Systems (CMS)

In terms of knowledge management, the documents that an organization produces,and publishes, represent its explicit knowledge. New knowledge can be created byefficiently managing document production and classification: for example, de-factoexperts can be identified based on authorship of documents. Document and, more ingeneral, content management systems (CMS) enable explicit-to-explicit knowledgeconversion.

A CMS is, in simple terms, a software designed to manage a complete web siteand the related information. It allows to keep track of changes, by recording whochanged what, and when, and can allow notes to be added to each managed resource.A writer can create a new page (with standard navigation bars and images on eachpage) and submit it, without using HTML specific software. In the same way, aneditor can receive all the proposed changes and approve them, or he/she can sendthem back to the writer to be corrected. Many other functionalities are supportedsuch as sending e-mails, contacting other users of the CMS, setting up forums orchats, etc. In a sense, a CMS is the central point for “presentation independent”exchange of information on the web and/or on organizations intra-nets.

The separation between presentation and creation of published resources is, in-deed, one of the added value of CMS adoption. CMS allows editors to concentratetheir efforts on content production while the graphical presentation issues are car-ried out by the system itself, so that the resulting web site is both coherent andhomogeneously organized. The same separation allows to better organize the website navigation and to define the cognitive path that users are expected to follow

28


when browsing the available pages.

CMS common features

Almost all CMSs help organizations to achieve the following goals:

• Streamline and automate content administration. Historically, Web contenthas consisted of static pages/files of HTML, requiring HTML programming ex-perience and manual updating of content and design: clearly a time-consumingand labor-intensive process. In contrast, CMSs significantly reduce this over-head by hiding the complexities of HTML and by automating the managementof content.

• Implement Web-forms-based content administration. In an ideal CMS, all con-tent administration is performed through Web forms using a Web browser.Proprietary software and specialized expertise (such as HTML) are not re-quired for content managers. Users simply copy and paste existing content orfill in the blanks on a form.

• Distribute content management and control. The Web manager has often beena critical bottleneck in the timely publication and ongoing maintenance of Webcontent. CMSs remove that bottleneck by distributing content managementresponsibilities to individuals throughout the organization. Those individualswho are responsible for content, now have the authority and tools to maintainthe content themselves, without any knowledge of HTML, graphic design, orWeb publishing.

• Separate content from layout and design. In a CMS, content is stored sepa-rately from its publication format. Content managers enter the content onlyonce, but it can appear in many different places, formatted using very differ-ent layouts and graphic designs. All the pages immediately reflect approvedcontent changes.

• Create reusable content repositories. CMSs allow for reuse of content. Objectssuch as templates, graphics, images, and content are created and entered onceand then reused as needed throughout the Web site.

• Implement central graphic design management. Graphic design in a CMSbecomes template-driven and centrally managed. Templates are the structuresthat format and display content following a request from a user for a particularWeb page. Templates ensure a consistent, professional look and feel for allcontent on the site. They also allow for (relatively) easy and simultaneousmodification of an entire site graphic design.

29


• Automate workflow management. Good CMSs enable good workflow pro-cesses. In the most complex workflow systems, at least three different in-dividuals create, approve, and publish a piece of content, working separatelyand independently. A good workflow system expedites the timely publica-tion of content by alerting the next person in the chain when an action isrequired. It also ensures that content is adequately reviewed and approvedbefore publication.

• Build sophisticated content access and security. Good CMSs allow for sophis-ticated control of content access, both for content managers who create andmaintain content and for users who view and use it. Web managers should beable to define who has access to different types of information and what typeof access each person has.

• Make content administration database-driven. In a CMS static, flat, HTMLpages no longer exist. Instead, the system places most content in a rela-tional database capable of storing a variety of text and binary materials. Thedatabase then becomes the central repository for contents, templates, graphics,users and metadata.

• Include structures to collect and store metadata. Because data is stored sepa-rately from both layout and design, the database also stores metadata describ-ing and defining the data, usually including author, creation date, publicationand expiration dates, content descriptions and indexing informations, cate-gories information, revision history, security and access information, and arange of other content-related data.

• Allow for customization and integration with legacy systems. CMSs allow cus-tomization of the site functionality through programming. They can exposetheir functionalities through an application programming interface (API) andthey can coexist with, and integrate, already deployed legacy systems.

• Allow for archiving and version control. High-end CMSs usually provide mech-anisms for storing and managing revisions to content. As changes are made,the system stores archives of the content and allows reversion of any page toan earlier version. The system also provides means for pruning the archivedcontent periodically, preferably on a base of criteria including age, location,number of versions, etc.

Logical architecture

A typical CMS is organized as in Figure 3.2.

30


Figure 3.2. A typical CMS architecture.

It is composed of five macro-components: the Editing front-end, the Site or-ganization module, the Review system, the Theme management module and thePublication system.

The Editing front-end is the starting point of the publishing chain implementedby the CMS. This component usually allows “journalists”, i.e., persons that producenew contents, to submit their writings. After submission, the new content is adaptedfor efficient storage and, if a classification module exists, it is indexed (and thiscould be done automatically or manually) in order to lately allow effective retrievalof stored resources.

The Review system constitutes the other side of the submission process, and it isdesigned to support the work flow of successive reviews that occur between the firstsubmission of a document and the final publication. Users allowed to interact withthis component are usually more expert users, which have the ability/authority toreview the contents submitted by journalists. They can approve the submitted dataor they can send it back to the journalists for further modifications. The approvalprocess is differently deployed depending both on the CMS class (high-end, middlelevel or entry-level) and on the redaction paradigm adopted: either based on a singlereview or on multiple reviews.

The Site organization module provides the interface for organizing the publishedinformation into a coherent and possibly usable web site. This module is specificallytargeted at the definition of the navigation patterns to be proposed to the final users,i.e. to the definition of the site map. Depending on the class of the CMS system, the

31


site organization is either designed by journalists or it is proposed by journalists andsubsequently reviewed by editors, possibly in a complete iterative cycle. Since manyconflicting site maps may arise, even with few journalists, the nowadays contentmanagement systems usually adopt the latter publication paradigm.

The Theme management component is charged of the complete definition of thesite appearance. Graphic designers interact with this module by loading picturesand graphical decorations, by defining publication styles, by creating dynamic andinteractive elements (menus for example), etc. The graphical aspect of a site is acritical asset both for the site success (for sites published on the Web) and for the siteeffectiveness in terms of user experience (usability and accessibility issues), thereforecare must paid when developing themes and decorations. Usually the theme creationis not moderated by the CMS. In addition, many of the currently available systemsdo not allow on-line editing of graphical presentations. Instead, each designer shalldevelop its own theme (or its part of a theme) and shall upload such theme on theCMS as a whole element. The final publication is subject to the editor approval,however, there are usually no means, for an editor, to set up an iterative refinementcycle similar to the document publication process.

The Publication system is both the most customizable and the most visible com-ponent of a CMS. Its main duty is to pick-up and publish resources from the CMSdocument base. Only approved documents can be published while the resources un-der review should remain invisible to the users (except to journalists and editors).The publishing system adopts the site map defined by means of the organizationmodule, and stored in the CMS database, to organize the information to be pub-lished. Links between resources can either be defined at publication time or can beautomatically defined at runtime according to some (complex) rules. The graphicalpresentation of pages depends on the theme defined by graphic designers and ap-proved by the editors. Depending on company needs and mission, pages can resultcompletely accessible, even to people with disabilities (this is mandatory for websites providing services to people, and is desirable for every site on the Web) orcompletely not accessible.

In addition to the publication of documents edited by the journalists, the sys-tem can offer to the final viewers much more services, depending on the sup-ported/installed sub modules. So, for example, a typical CMS is able to offer mul-tiple, topic-centered forums, mailing lists, instant messaging, white boards betweenon-line users, cooperation systems such as Wikis or Blogs, etc.

Semantics integration

As shown in the aforementioned paragraphs, CMSs are, at the same time, effec-tive and critical components for knowledge exploitation, especially for explicit to

32


explicit conversion (Combination) and for explicit to tacit conversion (Internaliza-tion). They often offer some kind of metadata-managing functions, allowing to keeptrack of authors of published data, of creation, publication and expiration dates ofdocuments, and of information for indexing and categorizing the document base.This information is only semantically consistent with the internal CMS database,i.e., it roughly corresponds to the fields of the CMS database. As shown by manyyears of research in database systems, this is actually a form of semantics, however itis neither related with external resources nor with explicitly available models. Storedmeta-information, although being meaningful inside the CMS-related applications,will thus not be understandable for external applications, making the whole systemless inter-operable.

A Semantic Web system, as well as a future Wisdom Web system, relates, in-stead, its internal knowledge to well known models, where possible, such as theDublin Core for authorship of documents. Even when sufficiently detailed modelsare not available, and must be developed “from scratch”, the way metadata is for-malized follows a well defined standard and it is understood by all “semantic-aware”softwares and architectures. Given that all Semantic-Web application shall manageat least RDF/S and the related semantics, interoperability is automatically granted.

Indexing and searching the CMS base can also take advantage of semantic infor-mation. As for metadata (which is in turn strongly related to indexing and retrieval)current systems already provide several, advanced facilities for allowing users to storeand retrieve their data, in an effective way. Unfortunately, the adopted technologiesare essentially keyword-based.

A keyword based information retrieval sub-system is based on the occurrence ofspecific words (called keywords) inside the indexed documents. The most generalwords such as articles, conjunctions, and the like, are usually discarded as they arenot useful for distinguishing documents. While, the remaining terms are collectedand inversely related to each resource in the CMS base. So, in the end, for eachword a list of documents in which the word occurs is compiled.

Whenever a user performs a query, the query terms are matched against the termlist stored by the retrieval subsystem and the correspondent resources are retrievedaccording to properly defined ranking mechanisms. Besides the accuracy and effi-ciency of the different systems available at now, they all share a common and stillproblematic issue: they are vocabulary dependent. Being based on keywords foundin the managed document base, these engines are not able to retrieve anything if thesame keywords do not occur both in the user query and in the stored documents.

A more extensive discussion about these topics is available in the InformationRetrieval systems sub-section, however it is easy to notice that semantics integrationcan alleviate, if not completely address, these issues by abstracting both queries anddocument descriptions to “semantic models”. The matching operation is, in thiscase, performed at the level of conceptual models, and if the conversion between

33


either the query or the document content, and the corresponding conceptual modelhas been effectively performed, matches can be found independently from vocabular-ies. This interesting capability of finding resources independently from vocabulariescan be leveraged by “Semantic Web” CMSs to offer functionalities much more ad-vanced than now. Language independence, for example, can be easily achieved. Soa user can write a query in a given language, which will likely be his/her motherlanguage, and can ask the system to retrieve data both in the query language andin other languages that he/she can understand. Matching queries and documentsat the conceptual level makes this process fairly easy. The currently available sys-tems, instead, can usually provide results only for same language of the query (inthe rather fortunate case in which the query language corresponds to one of thelanguages adopted for keyword-based indexing).

Other retrieval related functionalities include the contextual retrieval of “seman-tically related” pages during user navigation. When a user requires a given page ofthe site managed by a semantic CMS, the page conceptual description is picked upand used to retrieve links to pages, in the site, that have conceptual descriptionssimilar to the one of the required page. This allows for example to browse the pub-lished site by similarity of pages rather than by following the predefined interactionscenario fixed by the site map.

The impact of semantics on the CMS technology is not limited to storage, classi-fication and retrieval of resources. Semantics can also be extremely useful in definingthe organization of published sites and in defining the expected work flow for re-sources produced and reviewed within the CMS environment. There are severalattempts to provide the first CMS implementations in which content is automat-ically organized and published according to a given ontology. In the same way,ontologies are used for defining the complex interactions that characterize the docu-ment review process and the expected steps required for a document to be publishedby the CMS.

In conclusion, the introduction of semantics handling in content managementsystems can provide several advantages both for what concerns document storageand retrieval and for what concerns the site navigation and the site publication workflow.

3.2.2 Information Retrieval systems

Information Retrieval has been one of the most active research streams during thepast decade. It still permeates almost all web-related applications providing eithermeans, methodologies or techniques for easily accessing resources, be they humanunderstandable resources, database records or whatever. Information retrieval dealswith the problem of storing, classifying and effectively retrieving resources, i.e.,information, in computer systems. The find utility in Unix or the small Microsoft’s

34


dog are very simple examples of information retrieval systems. More qualified andprobably more diffused examples are also available, Google [9] above all.

A simple information retrieval system works on the concepts of document in-dexing, classification and retrieval. These three processes are at the basis of everycomputer-based search system. For each of the three, several techniques have beenstudied starting from the preliminary heuristic-based approaches until the nowadaysstatistical and probabilistic methods. The logical architecture of a typical informa-tion retrieval system is shown in Figure 3.3.

Figure 3.3. The Logical Architecture of a typical Information Retrieval system.

Many blocks can be identified: the Text Operations block performs all the operationsrequired for adapting the text of documents to the indexing process. As an examplein this block stop words are removed and the remaining words are stemmed. TheIndexing block basically constructs an inverted index of word-to-document pointers.The searching block retrieves all the documents that contain a given query tokenfrom the inverted index. The ranking block, instead, ranks all the retrieved docu-ments according to a similarity measure which evaluates how much documents aresimilar to queries. The user interface allows users to perform queries and to viewresults. Sometimes it also supports some relevance feedback which allows users toimprove the search performances of the IR system by explicitly stating which re-sults are relevant and which not. In the end, the query operations block transformsthe user query to improve the IR system performances. For example a standardthesaurus can be used for expanding the user query by adding new relevant terms,or the query can be transformed by taking into account users’ suggestions comingfrom a relevance feedback.

Describing in detail a significant part of all available approaches to informationretrieval and their variants, would require more than one thesis alone. In addition,

35


the scope of this section is not to be exhaustive with respect to available technologies,solutions, etc. Instead, the main goal is to provide a rough description of how aninformation retrieval system works and a glimpse of what advantages can be impliedby semantics adoption in Information Retrieval. For the ones more interested in thistopic, the bibliography section reports several interesting works that can constitutea good starting point for investigation. Of course the web is the most viable meanfor gathering other resources.

For the sake of simplicity this section prosecutes by adopting the tf ·idf weightingscheme and the vector space model as guiding methodology and tries to generalizethe provided considerations whenever it is possible.

Indexing

In the indexing process each searchable resource is analyzed for extracting a suitabledescription. This description will be, in turn, used by the classification process andby the retrieval process. By now we restrict the description of indexing at text-based documents, i.e. at documents which mainly contain human understandableterms. In this case, indexing intuitively means taking into account, in some ways,the information conveyed by the words contained in the document to be indexed.As humans can understand textual documents, information is indeed contained intothem, in a somewhat encoded form. The indexing goal is to extract this informationand to store it in a machine processable form. In performing this extraction twomain approaches are usually adopted: the first one tries to mimic what humansdo and leads to the wide and complex study of Natural Language Processing. Thesecond, instead, uses information which is much easier for machines to understand,such as statistical correlation between occurring words, term frequency and so on.This last solution is actually the one adopted by nowadays retrieval systems, whilethe former finds its application only on more restricted search fields where specificand rather well defined sub-languages can be found. The tf · idf indexing scheme isa typical example of “machine level” resource indexing.

The base assumption of tf · idf , and of other more sophisticated methods, is thatinformation in textual resources is encoded in the adopted terms. The more specifica term is, the more easily the argument of a document can be inferred. So the mainindexing operations deal with words occurring in resources being analyzed, tryingto extract only the relevant information and to discard all the redundancies typicalof a written language.

In the tf · idf case, the approach works by inspecting the document terms. Asfirst, all the words that usually convey little or no information such as conjunctions,articles, adverbs, etc. are removed. They are the so-called stop words and typicallydepend on the language in which the given document is written. Removing thestop words allows to adopt frequency based methods without data being polluted

36


by non-significant information uniformly occurring in all the documents.

Once purged documents from stop words, the tf · idf method evaluates thefrequency of each term occurring in the document, i.e., the number of times thatthe word occurs inside the document, with respect to the more frequent term. Inthe simplest implementation of tf · idf a vocabulary L defines the words for whichthis operation has to be performed.

Let ti be the i − th term of the vocabulary L. The term frequency tfi of theterm ti in the document di is defined as:

tfi(d) =�ti ∈ d

max(�tj ∈ d)

The term frequency alone is clearly a too simplistic feature for characterizing atextual resource. Term frequency, in fact, is only a relative measure of how muchimportant is (statistically speaking) a word in a document. However, no informationis provided on the ability of the given word to discriminate the analyzed documentfrom the others. Therefore a weighting scheme shall be adopted, which takes intoaccount the frequency with which the same term occurs in the documents base. Thisweighting scheme is materialized by the inverse document frequency term idf . Theinverse document frequency takes into account the relative frequency of the term tiwith respect to the documents already indexed. Formally:

idfti = log(�dk ∈ D∑D tfi(dk)

)

The two terms, i.e., the tf and the idf values are combined into a single valuecalled tf · idf which describes the ability of the term ti to discriminate the documentdi from the others.

tf · idfti(d) = tfi(d) · idfti

The set of tf · idf values, for each term of the vocabulary L, for a document d,defines the d representation inside the Information Retrieval system.

It must be noted that this indexing process is strongly vocabulary dependent:words not occurring in L are not recognized and if they are used in queries they donot lead to results. The same holds for more complex methods where L is built fromthe indexed documents or from a training set of documents: words not occurringin the set of documents analyzed are not took into account. So, for example, if ina set of textual resources only the word horse occurs, a query for stallion will notprovide results, even if horse and stallion can be used as synonyms.

NLP is expected to solve these problems, however its adoption in informationretrieval systems still appears immature.

37


Classification and Retrieval

In this section classification and retrieval will be described in parallel. Althoughthey are quite different processes, the latter can be seen as a particular case of theformer where the category definition is given at runtime, by the user query. It shallbe stressed that this section, as the preceding one, does not aim at being completeand exhaustive, instead, it aims at making clear some shortcomings of the retrievalprocess which are in common with classification and which can be improved bysemantics adoption. According to the previous subsection, the tf · idf method andthe Vector Space model [10] are adopted as reference implementations.

After the indexing process, each document, in the knowledge base managed byan IR system, has associated a set of features describing its content. In classification,these features are compared to a class definition (either predefined or learned throughclustering) to evaluate whether documents belong or not to the class. In retrieval,instead, the same set is compared against a set of features specified by a user inform of query.

The retrieval (classification) process defines how the comparison shall be per-formed. In doing so, a similarity measure shall be defined allowing to quantitativelymeasure the distance between document descriptions and user queries or categorydefinitions.

The similarity Sim(di,dj) defines the distance, in terms of features, betweenresources in a given representation space. Such a measure is usually normalized:resources having the same description in terms of modeled features get a similarityscore of 1, while resources completely dissimilar receive a similarity score of 0. Pleasenote that a similarity measure of 1 does not means that compared resources areexactly equal to each other. The similarity measure, in fact, works only on theresources features, and two resources can have the same features without beingequal. However, the underlying assumption, is that, although diverse, resourceswith similar features are “about” the same theme, from a human point of view.Therefore, the more high is the similarity between two resources, the more high isthe probability that they have something in common.

The Vector Space model is one of the most diffused retrieval and classificationmodel. It works on a vectorial space defined by the document features extractedduring the indexing process. In the Vector Space model, the words belonging to thevocabulary L are considered as the base of the vectorial space of documents d andqueries q. Documents and queries are in fact expressed in terms of words ti ∈ L andcan therefore be represented in the same space (Figure 3.4). Representing documentsand queries (or class definitions) in the same vectorial space allows to evaluatethe similarity between these resources in a quite straightforward manner, since theclassical cosine similarity measure can be adopted. In the Vector Space model thesimilarity is, in other words, evaluated as the cosine of the hyper-angle between the

38


Figure 3.4. The Vector Space Model.

vector of features representing a given document and the vector representing a givenquery. Similarity between documents and classes can be evaluated in the same way.Formally, the cosine similarity is defined as:

Sim(di,dj) =di · dj∣∣di

∣∣ · ∣∣dj

∣∣

where di is the feature vector of di and dj is the feature vector of dj. As demonstratedby the successful application of this model to many real-world case studies, theVector Space is quite an effective solution to the problem of classifying and retrievingresources. However, at least two shortcomings can be identified: as first, the modelworks assuming that the terms in L compose an orthogonal base for the space ofdocuments and queries.

This assumption is clearly not true since words usually appear in groups, depend-ing on the document (query) type and domain. Secondly, the approach is stronglyinfluenced by the features extracted from documents, and since they are in mostcases simple, vocabulary dependent, syntactic features it becomes also syntacticand vocabulary dependent. As an example, suppose that a wrong vocabulary Lw

contains the two words horse and stallion and suppose that they are not identifiedas synonyms (but actually they are, in the analyzed knowledge domain). If a doc-ument is composed by the single term horse and another document by the singleterm stallion they are recognized as completely different. In case a user specifies

39


horse as query keyword, only the first document is retrieved, while the second iscompletely missed, and vice-versa.

Semantics Integration

Semantics can play a great role in information retrieval systems allowing from oneside to face all the issues related to vocabularies and different terminologies, and onthe other side enabling users to perform “conceptual” queries. Conceptual queriesspecify the users’ information need with high level “concept descriptions” and donot use mere keywords, which can be imprecise and sometimes misleading (try tosearch “jaguar” on Google, will you find cars or animals?).

With respect to the topics described in the previous sections semantics can beintegrated in the indexing process as well as in the retrieval process. In the indexingprocess, semantics can be adopted by mapping, i.e., by classifying resources withrespect to a formal ontology.

Clearly the problem of term dependence still remains: resources are indexed asbefore and, in addition, a classification task is performed. However this dependenceis somehow mitigated because the ontology acts as a bridge and as a merging pointfor the different vocabularies adopted by the IR system. Whatever language is usedand whatever domain-specific vocabulary is adopted, the features resulting from in-dexing are always concepts, which are in turn language and vocabulary independent.The same holds for keywords in user queries: using a ontology as semantic back-bone, synonyms can be easily took into account, for example by associating manykeywords to each ontology concept (Lexicon).

In the retrieval process, semantics can by-pass the synonym-related issues andcan also provide new kinds of researches, which are in principle similar to the wellknown category search. The user, in fact, can directly select the concepts modeledin the IR ontology as a query. Such a query is then easily converted in retrievedresources, without vocabulary and, most importantly, without domain-related prob-lems. So, if a user is browsing a naturalistic web site, and performs a search selectingthe concept “jaguar”, only references to animals will be retrieved, the context beingfixed by the site ontology, which is about nature.

Eventually semantics adoption can easily support queries for related documentsand query refinement processes. Queries for related documents start from a samplepage, possibly retrieved by the IR system in a previous query, and require simi-lar pages that are stored in the IR system knowledge base. The ontology is newlythe angular stone of the process, allowing to find resources which have conceptualdescriptions similar to the one of the sample page. The process is completely vo-cabulary independent and transparently supports cross-lingual operations.

In query refinement, instead, the user specifies a query that provides bad or notrelevant enough results. The user can then refine his/her query by selecting new

40


terms, more specific or more general. Ontologies can be of aid also in this case: thesemantics relationships occurring between ontology concepts offer, in fact, a built-inmethod for query refinement. A semantic IR system can therefore easily allow usersto widen or restrict their queries for finding more relevant results. It can even runqueries proactively, in order to timely respond to user demands.

3.2.3 e-Learning systems

By its name, e-learning can best be understood as any type of learning deliveredelectronically. Defined broadly, this can encompass learning products delivered bycomputer, intranet, internet, satellite, or other remote technologies. Brandon Hall, anoted e-learning researcher, defines elearning as “instruction delivered electronicallywholly by a web browser, through the Internet or an intranet, or through CD-ROMor DVD multimedia platforms”.

E-learning is sometimes classified as synchronous or asynchronous. Both termsrefer to “the extent to which a course is bound by place and/or time”, accordingto The Distance Learner’s Guide (Prentice Hall: 1998). Synchronous simply meansthat two or more events occur at the same time, while Asynchronous means that twoor more events occur “not at the same time”. For example, when someone attendslive training, like in a class or workshop, then the event is synchronous, because theevent and the learning occur simultaneously, or at the same time. Asynchronouslearning occurs when somebody takes an on-line course in which he/she completeevents at different times, and when communication occurs via time-delayed email orin discussion list postings.

Today, many applications of e-learning principles and systems are emerging andare slowly pervading all the aspects of organization lives, be they big industries orsmall enterprises, or even educational institutes. E-learning is currently adopted fora variety of different situations. For example, it is used to:

• Deliver introductory training to employees, customers, or other personnel

• Offer refresher or remedial training

• Offer training for credentialing, certification, licensing, or advancement

• Offer academic or educational credit via college and university on-line learning

• Promote and inform an audience about products, policies, and services

• Support organizational initiatives by increasing motivation through easily ac-cessible learning

• Offer orientation to geographically dispersed personnel

41


• Create a variety of essential and nonessential learning opportunities for per-sonnel

• Provide coaching and mentoring through on-line instruction and collaboration

• Build communities of practice using distributed on-line training and commu-nication

• Standardize common training through fixed content accessible to all users

As with any learning medium, the use of e-learning offers benefits as yet not realizedin traditional training, while also presenting new risks to both producers and users.On the positive side, e-learning products:

• Energize content with illustrations, animations, and other media effects

• Offer increased fidelity to real-world application through scenarios and simu-lations

• Enable just-in-time, personalized, adaptive, user-centric learning

• Offer flexibility and accessibility

• Engage inexpensive, distribution capabilities to potentially a worldwide audi-ence

• Create stability and consistency of content due to the ease in which revisionscan be made

• Standardize content by centralizing knowledge and information in one format

• Cross multiple platforms of web browsing software

• Are less expensive to produce and distribute on a large-scale than traditionaltraining

• Eliminate travel and lodging expenses required for traditional, in-person train-ing

• Encourage self-paced instruction by users

• Support increased retention and improved comprehension of content

• Lend themselves to streamlined, easily scalable management and administra-tion of courses and users

42


Nevertheless, like any other training format, it also has disadvantages and risksassociated with its production and use. Before committing to e-learning, one mustconsider the following:

• Access sometimes varies based on user capabilities

• Internet bandwidth limitations and slow connection speeds sometimes hamperperformance

• User reaction and participation often depends on the level of individual com-puter literacy

• Development costs can exceed initial estimates unless clear production goalsare established

• Not all content is suitable for delivery via e-learning

• The loss of human instructor contact may be disconcerting to users

• Industry standards for development and delivery are still emerging

• Implementation is challenging if not well-planned in advance of development

Careful thought and planning must go into a decision to purchase, implement, andutilize e-Learning products, both those bought off-the-shelf or those customized forspecific purposes. This is often referred to as the “build or buy” decision. In eithercase, organizations considering e-Learning should conduct a comprehensive analysisof their needs, goals, education or training plans, and their current infrastructure todetermine if e-Learning is a suitable pursuit.

e-Learning standards

All the major features of e-Learning, i.e., the ability to customize courses, to trackprogress, to offer “just-in-time” learning opportunities, are only feasible if the ba-sic infrastructure of an e-Learning component is designed to be interoperable andcommunicates with components from a variety of sources. These elemental units areusually called “learning objects” and are the basis for the standardization movement.

There are numerous definitions of a learning object, but it is basically a small“chunk” of learning content that focuses on a specific learning objective. Theselearning objects can contain one or many components, or “information objects”,including text, images, video, or the like. Reusability shall be supported both at thelearning object and at the information object levels, and by standardizing the wayin which these objects are built and indexed, both learning objects and informationobjects shall be easy to find and use.

43


Standardization, currently occurring within the IEEE LTSC working groups, isacting on 5 areas: data and metadata, content-related issues, learning managementsystems and applications, learner-related issues and technical standards.Data and Metadata: metadata is defined as “information about information”. Ine-Learning, metadata describes learning objects including attributes such as author,subject, date, etc. Metadata is expected to enable learning objects to be more easilyindexed and stored in content repositories, and subsequently found and retrieved.The standards in this area:

• Specify the syntax and semantics of learning object metadata, so that theobjects can be described in a standardized format that is independent of thecontent itself. The standards also specify the fields to be used when describinglearning objects;

• Facilitate the translation of human languages (to re-purpose the content foruse in different cultures);

• Define a semantic framework that allows the integration of legacy systems andthe development of data exchange formats;

• Provide a common lightweight protocol for exchanging data among clients,servers and peers.

Content-related issues: standard related to content used in learning exist basi-cally to inform users of what they are getting, how they are getting it, and the bestway to use it. It is possible to think at these standards as to the “How-to” manualsfor learning content. The specific standards focus on:

• The language used to describe and reference the various media components(e.g., audio, video, animations). This will be useful in establishing the meansfor the portability of the components from one system, or tool, to another;

• A mechanism for managing and adapting the presentation of lessons accordingto the needs of the learner in order to dynamically create customized instruc-tional experiences;

• The packaging of learning content for allowing simple transmission and acti-vation of learning objects.

Learning management systems and applications: Learning management sys-tems play an important role in the facilitation of a learning object strategy. Themanagement system serves as a type of gateway where content enters, is assembledinto meaningful lessons based on the learner’s profile, and is presented to the learner,whose progress is then tracked by the management system. It is crucial, therefore,

44


that the system is able to operate with content and tools from multiple sources. Thestandards developed in this category:

• Allow lessons and courses to move from one computer managed instructionsystem (CMI) to another, while maintaining its ease of use and functionality;

• Make it easier for learning technologies to be implemented on various types ofbrowsers and operating systems;

• establish a protocol that aids in the communication between the software toolsthat a learner is using (e.g. text editors and spreadsheets) and the instructionalsoftware agents that provide guidance to the learner.

Learner-related issues: the main purpose of developing standards is to create amore effective and efficient way for people to learn using technology. The learner-related standards focus on creating a connection between the learner and the tech-nology (and its developers). Groups involved in this effort are working to createstandards that aid in characterizing, identifying, and tracking learners, and in profil-ing their competencies. The availability of this information enables the developmentof more appropriate instruction for the learner. More specifically, the standards dealwith:

• The language used in a “Learner Model” that will maintain a characterizationof the learner, including such attributes as knowledge, skills, abilities, learningstyles, records, and personal information. This creates a sort of electroniclearning portfolio that can be used by the learner throughout his lifetime toenhance learning experiences;

• A mean for identifying learners for sign-on and record-keeping purposes;

• The components of a user-centered system that aid in the process of managinglife-long learning. It will help with goal-setting, planning, execution, tracking,and documentation in order to provide learners with guidance that help themachieve independence in reaching goals, as well as provide documentation ofachieved competences.

Technical standards: the standard currently adopted for the exchange of e-Learning information over the Internet is XML, whereas HTML is the preferredlanguage for telling to a system how to present (format) content on a page. Manyinnovations are actually under study, in particular for what concerns the learningobject characterization and transmission, and they are somewhat related to thestandardization efforts just cited.

45


Semantics Integration

In order to understand what contribution can semantics provide to e-Learning appli-cations, a “near future” scenario can be analyzed, extracting the important pointsin which the availability of explicit semantics makes the difference with respect tonowadays solutions.

Imagine that you are studying Taylor expansions in mathematics. Your teacherhas not yet provided the relevant links to the involved concept in your semantics-enabled learning framework, so you first enter “Taylor expansions” in the classicalsearch form provided by the system. The result list shows that Taylor expansionsoccurs in several contexts of mathematics, and you decide to have a look at Taylorexpansions in an approximation context, which seems most appropriate for yourcurrent studies.

After having dwelled a while on the different kinds of approximations, you decideyou want to see if there are any appropriate learning resources. Simply listing theassociated resources turns out to return too many results, so you quickly draw aquery for “mathematical resources in Italian related to Taylor expansions that areon the university level and part of a course in calculus at an Italian university”.

Finding too many resources again, you add the requirement that a professor atyour university must have given a good review of the resource. You find then someinteresting animations provided as part of a similar course at a different university,where it has been annotated in the personal portfolio of a professor at your university,and start out with a great QuickTime animation of Taylor expansions in threedimensions.

The movie player notes that you have a red-green color blindness and adjuststhe animation according to a specification of the color properties of the movie whichwas found together with the other descriptions of the movie. After a while you aregetting curious. What, more precisely, are the mechanisms underlying these curvesand surfaces? You decide you need to more interactively manipulate the expansions.So you take your animation, and drag it to your graphing calculator program, whichretrieves the relevant semantic information from the learning object via the applica-tion framework, and goes into the Web looking for mathematical descriptions of theanimation. The university, it turns out, never provided the MathML formulas de-scribing the animations, but the program finds formulas describing a related Taylorexpansion at the MIT OKI site. So it retrieves the formulas, opens an interactivemanipulation window, and lets you experiment.

Your questions concerning Taylor expansions multiply. You badly feel the needfor some deeper answers. Asking the learning system for knowledge sources at yourown university that have announced interest in helping out with advanced Calculusmatters, you find a fellow student and a few math teachers. Deciding that you wantsome input from the student before talking to the teachers, you send him/her some

46


questions and order your calendaring agent to make an appointment with one of theteachers in a few days.

A week later you feel confident enough for changing the learning objective statusfor Taylor expansions in your portfolio from ’active, questions pending’ to ’resting,but not fully explored’. You also mark your exploration sequence, the conceptualoverviews you produced in discussions with the student and some annotations, aspublic in the portfolio. You conclude by registering yourself as a resource on thelevel ’beginner’ with a scope restricting the visibility to students at your universityonly.

In this scenario some points can be identified where semantics integration playsa crucial role:

• Distributed material and distributed searches. Here semantics helps by elimi-nating vocabulary related issues as well as interoperability issues. Describinglearning objects through well known standard languages such as XML, RDF/Sand OWL allows rapid exchange of information between different applications.Moreover, the ability to share a common model of the domain knowledge, orthe ability to link each application model (ontology) to a well known andshared ontology, enables easy interoperation between learning systems andframeworks.

• Combination of metadata schemas, for example, personal information andcontent descriptions. In a fully semantic learning framework the informationsabout user’s profile and preferences can be integrated as semantic filters forthe retrieval process, allowing to restrict the search results at the subset whichsatisfies both the user information need and the user fruition model.

• Machine understandable semantics of metadata: so that a machine is ableto correctly interpret constraints, calendaring info, for example, and to findcorrect resources between the available learning objects and learning informa-tions.

• Human-understandable classification of metadata. The user is able to directlyspecify the context of searches (by clicking the involved concepts), the personswith which he/she wants to interact, and is able to understand the classifica-tion of available resources, thus being able to select the best suited results.

• Interoperability between tools. As the basic semantics of RDF/S and of OWLmust be understood by every application which is able to manipulate theserepresentations, interoperability is automatically granted.

• Distributed annotation of any resource by anyone, in this case using digital

47


portfolios. Every resource in RDF/S and OWL has a unique identifier, there-fore annotations can be about whatever needed. Attributes can define the an-notation author, the annotator trust level, experience, etc. offering the basicinfrastructure for building a potentially world wide collaboration framework.

• Personalization of tools, queries and interfaces, affecting the experience inseveral ways. Semantic metadata is not only focused on describing the contentof learning materials, it can also be used to describe the physical features ofthe same data. So, for example a video can be recognized as problematic forred-green blindness, and its color parameters can be adjusted according to auser profile in order to avoid vision problems.

• Competency declaration and discovery for personal contacts. A complete se-mantic characterization of e-Learning environments also includes informationabout level of competence and trust mechanisms. Users are therefore able todesign their own interactions by selecting the profiles of other users/teacherson the basis of human-kind values such as competence, trust, sympathy, kind-ness, etc.

48

Chapter 4

Requirements for Semantic WebApplications

The integration of semantic functionalities in real world web applications requiresa careful analysis of the requirements that such functionalities should satisfy, find-ing the best trade off between user requirements, site publishers requirements anddevelopers requirements. Requirements are usually categorized as functional or non-functional. The former express actions that a system should perform, defining boththe stimulus and the expected response (input and output). Identifying, in otherwords, things that systems shall perform. Non-functional requirements, instead,address the different facets of system deployment such as performance, usability,robustness, security, hardware and so on.

In order to be able to design a successful integration of semantic functionalitiesinto web applications, requirements must be gathered by involving several groupsof people: developers, end users, site publishers, content editors, etc. Requirementsare, actually, a contract between these diverse groups of people and a proper repre-sentation of the involved parties is important.

4.1 Functional requirements

Requirements for semantic-aware web sites and applications have been gathered, inthis thesis, by interviewing the different actors involved in the site business logic:publishers, authors, developers. These requirements have then been prioritized ac-cording to user’s and publisher’s needs. A separation has been derived betweenfunctionalities actually needed and extensions or innovations desired/expected fromthe availability of semantic processing tools in the standard site development andpublication work flow. The output of the entire process was as follows:

1. The system shall evaluate the relatedness of two resources on the basis of their

49

4 – Requirements for Semantic Web Applications

content, in terms of ontology concepts. A page about wheel-chairs and a pageabout physical barriers shall be related since the presence of physical barriersprevents the access to a given facility for people using wheel-chairs.

2. The system shall serve different users, of different nationalities: all requiredservices must support the use of different languages.

3. The system shall enable cross-lingual queries, i.e. queries written in a givenlanguage that provide results in a different language(s). As an example, a usermight specify a query in English and require results in English, Italian andFrench.

4. The system shall provide a directory-like view of the ontology, for the knowl-edge domain in which it works.

5. When a user selects a category label, a semantic web application shall providethe resources classified as “pertaining” the category (ontology concept).

6. A Semantic site shall provide semantic what’s related functionalities: whenevera user browses a page, the publication system shall provide a selection of pages(no more than 10) that are related to the viewed page.

7. The system shall provide a classical search (textual search) functionality basedon resource contents. That is to say, it shall provide a search engine able towork on synonyms and to provide results even if the words in the query donot occur in the indexed resources.

8. The system shall allow for manual classification of resources with respect toa given conceptual domain. The content editor (journalist) must therefore beenabled to easily specify the concepts for which a resource is relevant.

9. For each selectable concept, the system shall provide a label and an extendeddescription, localized in the user language. This facilitates the user compre-hension of the domain model, thus reducing misclassified resources.

10. The system shall facilitate manual classification of resources by providing theavailable concepts for classification (the old keywords, in a sense) and by onlyallowing selection of concepts actually present in the ontology.

11. The system shall support semi-automatic classification of resources, by pro-viding suggestions for the possible classification of a given resource.

12. The system should support automatic classification of resources.

50

4.1 – Functional requirements

13. The system shall be able to classify both owned and not owned resources.Therefore a site using the system should be able to offer conceptual searchesboth on its internal resources and on resources of other related sites.

14. In response to a specific user setting, the system should provide semantic andtransparent search functionalities. In other words, the system should performsearches in background, while the user is surfing the site, taking advantageof the information coming from the user navigation to generate meaningfulqueries.

15. In the transparent search mode, the system should provide additional infor-mation, i.e. retrieved pages, if and only if the relevance of results, as perceivedby the system, is over a reasonable threshold. Such a threshold should be setby the user and can be modified by the user during the site navigation (in away like the “Google Personalized” system).

16. The system can provide a relevance feedback plug-in for the most diffused webbrowsers (Mozilla and Internet Explorer at least) where the user can view theclassification of visited pages as deducted by the system and can correct them.

17. The system should guide the logical organization of a site, by proposing asuitable location for a new page in the site map, depending on the conceptualclassification of that resource. For example, a new page about municipalityaids for people with disabilities should be easily accessible from the pagesabout the municipality services and from the pages about disability aids.

The highest priority needs resulting from the gathering phase concern the search-related part of web sites, mainly because of the poor performances of the nowadayssyntactic search engines. Performances are perceived as poor despite the high pre-cision and recall values of such search engines in force of the syntactic nature of thistechnology. In fact syntactic engines are not able to retrieve resources on the basisof their conceptual content but they are only able to address retrieval by using theoccurrence of a finite set of keywords that may not appear in a user query.

The first requirement states that the similarity between two resources and thesimilarity between queries and resources must be evaluated on the basis of resourcecontent in a semantics rich way. The immediately following requirement is, surpris-ingly, not directly concerned with the search task but deals with multilingualism.In the era of global services the ability to handle service requests and responsesin different languages is perceived as a critical factor. Moreover the capability toperform cross language operations, i.e. operations whose triggering is in a languageand whose result is in another language, is a value added that many web operatorswould like to offer to their users. Multilingualism is, in a sense, completely indepen-dent from the adoption of semantics in web applications, however it is much more

51


simple to obtain when the main business of searching, classifying and matching isperformed at a conceptual level, which is language independent.

Requirements from 4 to 7 are newly related to search functionalities: in a fewwords they state that the limits of syntactic search engines can be surmounted byadding semantics to the resource classification. These new semantic-based searchengines must adopt new, powerful interfaces able to exploit their full potential.Such interfaces must be not very dissimilar from the traditional ones in order tomaximize the user interaction. The envisioned interaction models include therefore:the classical keyword-based query interface, the directory search and only one “new”interaction paradigm called semantic what’s related in which, for each page requestedby a given user, a set of links to related pages is provided, where the relation betweenpages is defined on the basis of their conceptual descriptions.

The other side of the medallion, i.e. the classification task, is addressed in therequirements labeled from 8 to 13 and refers to three different degrees of automationin the semantic classification of resources. The first requirement assumes that toprovide results relevant to a human user, the classification must be performed by ahuman. He/she manually annotates the published resources by creating associationsbetween such resources and a model of the conceptual domain into which the webapplication is deployed (ontology).

The following requirement, states that human classification is actually the onlyway to ensure the provision of “meaningful” results, as said by the 8-th one. Howeverthis task could sometimes be over helming, especially if the conceptual domain isvast and complex. In such cases, intelligent systems must be of aid by doing thehard work and by only involving humans for approval, modification or rejection ofautomatically extracted classifications.

Requirements 12 and 13 tackle those situations in which the content publishercannot carry out the classification task, as an example because he is not the authorof the resources. In such a case, the system is entrusted to automatically categorizethe resources by providing classification results similar to the ones that a humanwould provide for the same resources.

Eventually, requirements from 14 to 17 refer to those functionalities that wouldconstitute a value added for semantic web applications but that are not critical, i.e.that are not compulsory for the successful integration of semantic functionalitiesinto web sites. Such requirements have not been addressed in the work presentedby this document, however they are object of future work trends reported in theending section of the thesis.

52

4.2 – Use Cases for Functional Requirements

4.2 Use Cases for Functional Requirements

In this section three interaction scenarios between a semantic web application and auser (in a wide sense) are reported, addressing in more detail the “semantic what’srelated”, the “directory search” and the “semi-automatic classification” tasks. Thewidely adopted formalism of uses cases is used.

4.2.1 The “Semantic what’s related”

The use case (Figure 4.1) is deployed as follows: a user surfs a given page on theweb site under examination; the publication system detects the required page andthen extracts the conceptual description (index) associated to the page.

Before providing the page to the user browser, the system inquiries its internalknowledge base for resources tagged with conceptual descriptions similar to the oneof the page being requested. It then evaluates the relevance of retrieved resultsand filters out the ones below a given confidence threshold. In the end, it ranksthe remaining pages using a semantic similarity measure and appends to the pagerequired by the user, a set of links to conceptually similar resources.

The amount of provided links shall be designed to not impact the page fruitionprocess too much. Provided information, in fact, must not relieve the user attentionfrom the page content. Moreover the number of provided links must be small enoughto be managed by the user, say no more than ten, according to standard usabilityrules.

4.2.2 The “Directory search”

The directory search is an almost standard and well known way of performingsearches on the web: the majority of web search engines such as Google [9], Ya-hoo [11] and Altavista [12] adopt this interaction paradigm for providing thematicaccess to indexed resources.

A semantic directory is similar to a classical directory: from the user point of viewthe changes are negligible, however, in this case, resources belong to categories in amore dynamical way. The directory is, in fact, a tree representation of an ontology,obtained by taking into account only hierarchical relationships. As the associationbetween resources and ontology concepts is “fuzzy”, i.e. resources could refer todifferent ontology concepts, possibly with different association strength, the sameresource could belong to different directory branches depending on its conceptualdescription, taking also into account the non-hierarchical relationships defined inthe ontology.

The use case (Figure 4.2) is deployed as follows: a user searches the semantics-aware web site for pages about “disability”. In order to perform this task, he/she

53


Figure 4.1. The “What’s related” use case.

selects the directory service of the site where resources are subdivided into homo-geneous thematic sets. He/she searches for “disability” and subsequently selectsthe corresponding directory entry for retrieving resources. The system receives therequest, maps the directory entry to the corresponding ontology concepts and ex-tracts the most relevant resources with respect to that query. In other words, thesystem identifies the involved ontology concepts and searches its knowledge base forresources annotated as “about” those concepts. Then it ranks the available dataaccording to the semantic similarity with the user query (the selected concepts) andfinally it provides the results as a set of hyper links.

4.2.3 The “Semi-automatic classification”

In this use case the actor is the content redactor of a semantic web site. The redactoraccomplishes the daily tasks of writing new content, of reviewing pages that are in

54

4.2 – Use Cases for Functional Requirements

Figure 4.2. The “Category search” use case.

the wait list for publication, and of establishing the next steps in the site editorialactivity.

This use case (Figure 4.3) is focused onto the process of creating new contentand it is specifically devoted to clarify the content classification phase.

When a redactor has completed the editing of a new resource (article), he needsto classify the new resource, taking into account the site knowledge base, in orderto allow users to retrieve and navigate the site pages according to their conceptualcontent (see the previous use cases).

The classification is a manual process in this use case: the redactor has accessto the ontology and is required to select the concepts that are relevant with respectto the newly created content, possibly specifying a measure of “relatedness” in therange between 0% and 100%. To perform such task, he can navigate a tree-likeontology representation, and select the relevant concepts.

For each concept a simple interface allows to specify the degree of correlation

55


between the article being classified and the concept, providing four choices for therelation strength: low (25%), middle (50%), high(75%) and very high (100%).

As the ontology could contain hundreds or thousands of different and complexlyrelated concepts, the classification task easily becomes infeasible both in terms oftime consumption and of cognitive overload on the redactor. The system musttherefore provide some aids to the classifier, possibly retaining as much as possiblethe quality of the classification.

This can be done by quickly scanning the resource under examination and byproviding suggestions to the redactor. The system does not need to perform a fulland accurate classification of the resource being edited but needs only to extract themost relevant conceptual areas that seem to be related to that resource. It has alsothe ability to modify the ontology navigation interface by highlighting the suggestedconcepts thus allowing its human counterpart to refine the proposed annotations oreven ignore them and proceed with the usual, manual, classification.

Once selected all the relevant ontology concepts, the redactor submits the newarticle to the system that stores both the content and the classification, allowingusers to semantically query the site for relevant resources, including the one justloaded.

4.3 Non-functional requirements

Non-functional requirements are fundamental parts of the design process, in what-ever application, since they guide the design by defining the physical constraintsthat fix the boundaries of the application for what concerns its deployment in thefinal work scenario. They are usually subdivided into different homogeneous areassuch as usability, performance, security, robustness, etc.

In this thesis, the main focus is on the usability and performance requirementsbecause one of the goals of this work is to provide a system for adding semanticsto already deployed web sites, easily. Clearly, security issues and system robustnessare important as well, however the scope of this work is much more focused onproviding ready to use technology rather than absolutely “safe” solutions. Thedesign priority, in other words, is to provide a system for immediate use, trying toovercome, although in a little scenario, the “semantic exploitation” problem.

Gathered requirements reflect exactly this choice and are subdivided in “usabilityrequirements” and “performance” requirements. The former are:

1. The system deployment should virtually not require a downtime of the publica-tion system that is being integrated, or at least should constrain the downtime,in normal operational conditions, to few hours.

2. The system should allow the integration of semantic functionalities reducing

56

4.3 – Non-functional requirements

Figure 4.3. The “Semi-automatic classification” use case.

as much as possible the additional effort required for specifying semantic de-scriptions and more in general reducing the changes in the publication workflow, from the user/redactor point of view.

3. The system should be integrable into different site publication systems, inde-pendently from the server side technology adopted for content publication.

4. The system should allow the reuse of already existing technologies: databases,servers, etc.

5. The system should be platform independent.

6. The system should allow the manipulation of different data formats, at leastHTML, XHTML and plain text.

7. The system should be extensible for incorporating new functionalities and forallowing the handling of different resource types such as multimedia resources.

57


The latter, instead, include the following:

1. The system should scale up from sites with few pages (around one hundred)to really big sites with thousands of pages.

2. The system should be time effective, i.e., the system should provide resultsin reasonable time, even when it is overloaded by several concurrent requests.The response time is effective if contained into the user attention time-framethat, for this kind of application is around ten seconds.

The above reported requisites for semantics integration in nowadays web appli-cations are essentially resumable with the words “easy integration”. The highestpriority requirements state, in fact: a semantics-aware system shall be deployablewithout requiring extra downtime of the site publication system that it integrates.A semantic system should be transparent, i.e. the system presence should be un-noticeable for both editors/publishers and end users. A semantic system shall notconflict with already deployed technologies and shall be accessible from whateverpublication framework.

These are requirements that fall in the “usability” requirements class; in the sameclass there are also some low priority requirements that represent further evolutionsor specifications of what intended in the high priority requirements.

From these last, in fact, emerges the necessity to reuse as much as possibleexisting technologies such as databases, web servers, etc. (see requirement 4) andthe necessity to design platform independent systems (requirement 5). Finally somerequirements tackle the ability of a semantic system to handle many media. Theweb people (content publishers, editors, redactors, users) are in fact more and moreaware of the impact that new media have on clients and end users. As a consequence,while traditional technologies such as HTML pages and their evolutions (DHTML,XHTML, ...) are naturally included into semantic elaboration, also new informationmeans such as audio streams, videos, DVDs and in general multimedia, shall be tookinto account and possibly supported.

Performance issues deserve at least some attention in the design process, infact, once fixed the other requirements, performance can consistently affect theeffectiveness of a semantic system and can strongly influence its adoption in realworld applications. Scalability and timely responses (requirements 1 and 2 underperformance) are therefore high priority requirements that must be satisfied to fillthe gap that still persists between academic applications and real world solutions.

58

Chapter 5

The H-DOSE platform: logicalarchitecture

This chapter introduces the H-DOSE logical architecture, and uses sucharchitecture as a guide for discussing the basic principles and assump-tions on to which the platform is built. For every innovative principlethe strength points are evidenced together with the weaknesses emergedeither during the presentations of such elements in international con-ferences and workshops or during the H-DOSE design and developmentprocess.

The requirements analyzed in the previous chapter are at the basis of the designand implementation of a semantic web platform specifically targeted at offering low-cost semantics integration for already deployed web sites and applications. Suchan integration is specifically oriented to information-handling applications such asCMSs, Information Retrieval systems and e-Learning systems.

Designing a complete semantic framework involves roughly two levels of abstrac-tion, the first one being concerned with the so-called logical design while the secondis more focused on practical implementation issues and is called deployment design.In this chapter the logical design of the H-DOSE1 platform is addressed while thedeployment design is tackled in Chapter 6. H-DOSE, stands for Holistic DistributedSemantic Elaboration Platform. The reasons for calling it holistic will become clearin the following sections.

1The name H-DOSE comes from “Holistic DOSE”, since it integrates and reconciles differentpoints of view (semantic web, web services, multi agent systems, etc). It is commonly pronouncedas “High-Dose”.

59

5 – The H-DOSE platform: logical architecture

5.1 The basic components of the H-DOSE seman-

tic platform

Describing a semantic platform in a sound, complete and unique way is nearlyimpossible, since there are many opinions and ideas on what a semantic platformshould provide as service. In this thesis, a semantic platform is not consideredas a general purpose framework but as a solution strongly oriented to informationmanagement. In particular, it is assumed that the main scope of the describedplatform is to provide support for indexing, classification and retrieval of web pages,either written in HTML, XHTML or in plain text.

Under such quite restrictive conditions a semantic platform can be depictedas composed of three main elements: one or more ontologies, a set of semanticdescriptions (or annotations) and a set of textual resources (Figure 5.1).

Figure 5.1. The logical architecture of the H-DOSE semantic platform.

Reminding the definition given in Chapter 2, an ontology can be defined as “anexplicit and formal specification of a conceptualization”. It is composed of concepts(or classes), from a given knowledge domain, and of relationships which relate toeach other the ontology classes. The whole combination of concepts and relationshipsmodels a knowledge domain and defines the set of topics that a semantic platformcan manipulate.

60

5.1 – The basic components of the H-DOSE semantic platform

The first difference between a syntactic information management application anda semantic one is the scope of managed resources. While the former can virtuallymanage whatever resource, given that appropriate keywords exist, the latter cannot.In fact, the ontology that provides means for abstracting information processesfrom syntax, achieving terminology independence, also fixes the limits in whichsuch abstraction can work. Ontology-based application are then domain specific,while syntactic applications can be omni comprehensive. However, in the scope ofthis thesis, this is not a limitation since CMSs, e-Learning systems and the relatedInformation Retrieval systems are naturally domain specific. After all, the on-linelearning facilities of a University, for example, provide data corresponding to thecourses offered by the same Institution, which are always well known and limited toa single, even broad, knowledge domain.

Resources are in the more general sense “all things about which someone wantsto tell something”. In a semantic platform this general definition is restricted to“all the resources, in a given knowledge domain, about which something can besaid by someone”. So the first restriction is newly on the domain, which must bespecific, as defined by the platform ontology. In addition, in this thesis a morerestrictive assumption is done: resources are “textual resources, either written inHTML, XHTML or in plain text, that shall be published on a web site or that shallbe used by a web application”. Then, for the rest of this document, resources willbe texts, unless otherwise specified.

Finally, descriptions, or annotations, are the “semantic bridges” between syntac-tic resources, i.e. texts, and semantic entities, i.e, ontology concepts. Annotationscan be expressed as simple triples in the form “the text A is about the concept C ”, orthey can include more complex information such as the strength of the “about” re-lationship, as an example. Annotations, in the working scenario of this dissertation,are defined either by journalists/redactors or by the semantic platform. They definethe mutual relations and similarities between resources, and between resources andqueries. Final users of the semantic web application are, in general, not allowed todefine annotations. This assumption will allow, in the subsequent sections, to notconsider trust-related issues and so on.

5.1.1 Ontology

The H-DOSE semantic platform consider multilingualism as a critical issue thatmust be addressed as stated by the functional requirements 2 and 3 reported inChapter 4 - Section 4.1.

Multilingualism issues in ontology-based applications are still active researchtopics in the Semantic Web community. For solving them two main approachesare currently under investigation: the first is based on the integration of language-specific ontologies via ontology merging techniques, while the second assumes that

61


a common set of concepts exists, that could be shared by different languages. Ac-cording to the first approach, people speaking different languages will model a givendomain area by defining different ontologies that will be merged to provide a multi-lingual semantic environment. The resource requirements for language managementwill be comparable, in term of human and hardware resources, to that of ontologymerging, actually being the same task.

Moreover, as many aspects of the predefined knowledge area will be commonfor all languages, many redundant “synonymous” relationships between languagespecific ontologies will be defined, increasing resource wasting and complexity. De-veloping a multilingual semantic framework using the above exposed methodology,involves then the risk of getting an unmanageable entity as an outcome, in whichgreat care is required to define relationships between “equivalent ontologies” and totrack changes and coherently update those relations.

The second methodology addresses multilingualism issues by using an holisticapproach: when a common knowledge domain is modeled, it is likely that mostdomain experts, even working in different languages, can identify a common “core”set of concepts. A single, language independent ontology could therefore be used tomodel such area, sharing concepts between different languages.

In the initial phase, this new stream of research was affected by a commonmisunderstanding, which was strictly related to concept naming. Many times, infact, the concept name and/or its definition is considered as the concept itself, whilethe second is actually an abstract entity to which people refer by using words.

A good definition of concept can be: “a well defined sequence of mental pro-cesses”. In fact, words are only used as a trigger to the mental association that letsus identify and instantiate concepts. In other words, a generic concept, a dog forexample, can be correctly identified using any alphanumeric string without any lossof generality. On the other hand, ontology designers usually define concepts usingwords, or sequences of words, in order to easily identify the meaning of the describedentity. To name concepts with language specific descriptions is just a usability trickrather than a strict requirement, and should have no semantic implications.

Ontology is by definition language independent, while its instantiation in a spe-cific idiom is effectively achieved by the adoption of a proper set of textual descrip-tions for each concept, in each supported language. Such approach is inherentlyscalable, as new languages can be easily supported by integrating new lexical enti-ties and definitions, and possibly by slightly restructuring a small number of ontol-ogy nodes. Moreover, special purpose relationships could be defined (e.g., links toWordnet [13] entities) providing ground for sophisticated functionalities.

The H-DOSE approach uses a language independent ontology2 where concepts

2Firstly presented at SAC 2004, ACM Symposium on Applied Computing, Cipro

62


are defined as high-level entities for which language dependent definitions are spec-ified. Such semantic entities are linked to a set of different definitions, one for eachsupported language, and to a set of words called “synset”. Operationally they canbe defined as:

concept :== concept ID, lex

lex :== (lang ID, description, synset)+

synset :== (word)+

A concept definition is a short, human readable, text that identifies the conceptmeaning as clearly as possible, and that is expressed in a specific language. A synset,instead, is composed of a set of near-synonymous words that humans usually adoptto identify the concept.

The process of linking ontology concepts and lexical entities can be deployedaccording to three different approaches: integration, annotation and hybrid. In theintegration approach, lexical entities are included in the ontology as new semanticentities, a sort of special instances defined for each ontology concept. This approachmakes automated reasoning on lexical entities much more easier. The disadvantageis that, whenever a new lexical entity must be added or an already existing termmust be modified, the entire ontology is involved.

The annotation approach solves this shortcoming by keeping separate the ontol-ogy and the lexical entities. Lexical entities, which can be, for example, the sensesof a Wordnet-like lexical network, are then tagged as “related” to a given ontologyconcept. Tagging allows to modify, update or even delete lexical entities withoutmodifying the ontology.

The latter hybrid approach mixes the first two approaches by keeping separatelexical entities and ontology concepts, and by restructuring the networks betweenlexical entities (Wordnet) in order to better reflect the ontology view of the givenknowledge domain. This approach is a little more flexible than the integrationapproach, since modifications in lexical entities do not necessarily impact on theontology definition. However, the allowed changes can only be relatively small,otherwise the structure of lexical entities would likely require a revision in order tokeep reflecting the domain view imposed by the ontology.

H-DOSE adopts the annotation approach by keeping the ontology physicallydistinct from definitions and synsets. This allows a separate management of conceptsand language-specific information, and a complete isolation of the semantic and thetextual layers. Language specific semantic gaps are supported by including in someconcepts the definition and synsets in the relevant languages, only. This assumptionguarantees sufficient expressive power to model conceptual entities typical of eachlanguage and, at the same time, reduces redundancy by collapsing all commonconcepts into a single multilingual entity. The final resource occupation is, by a

63


great extent, comparable to that of a monolingual framework, thanks to redundancyelimination, while expressive power is as effective as needed.

Synsets and textual definitions should be created by human experts through aniterative refinement process. A multilingual team works on concept definitions bycomparing ideas and intentions, aided by domain experts with linguistic skills for atleast two different languages, and formalizes topics in a mutual learning cycle. Thisinteraction cycle produces, at the end, two sets of concepts: general concepts andlanguage-specific concepts (Figure 5.2).

Figure 5.2. The H-DOSE approach to multilingualism.

The two sets are modeled in the same way inside the ontology. However, conceptsbelonging to first category will be linked to definitions and synsets expressed in eachsupported language, while those belonging to the second set will be linked to smallersubsets of languages.

The complex interaction between ontology designers, users and domain expertsrequired by this approach at design time must build upon the availability of an inter-national network in which people cooperate to model a defined knowledge domain.Such kinds of networks have already been proposed, for example in the EU SocratesMinerva CABLE project [3]. CABLE involves a group of partners with proved skillsin learning and education and promotes cooperation to define learning materialsand case studies for continuous education of social workers. In CABLE, teams ofexperts in social sciences and education cooperate, with the support of “multilin-gual” domain experts, in defining case studies, teaching procedures and identifyingthe related semantics in a so-called “virtuous cycle”. The CABLE project can ef-fectively apply the H-DOSE approach to multilingualism, leveraging its “virtuous

64


cycles” and defining a multilingual ontology for education in social sciences. Theresulting ontology, definitions and synsets, can constitute an effective core for theimplementation of multilingual, semantic, e-Learning environments.

Formal definitions

An H-DOSE ontology is formally defined as a set O of concepts c ∈ C and relationsr ∈ R, together with a set of labels L, descriptions E and words S associated to theconcepts:

O : {C, R, χ, L, E, ψ}where

C : concepts c ∈ C

R : relations r ∈ R

R ⊆ C × C

χ : R→ [0,1]

Wlang : words for a language << lang >> w ∈Wlang

S : set of near − synonymous words s ∈ S

s ⊆ C ×Wlang

ψ : S → [0,1]

L : labels of concepts l ∈ L

l ⊆ C ×Wlang

E : descriptions of concepts e ∈ E

e ⊆ C ×Wlang

C is the set of concepts c in the ontology O while R is the set of relations r relatingontology concepts to each other with a given strength χ in a range between 0 and1. W is, instead, the set of all possible words w in a given language, S is the set ofnear synonymous words associated to each ontology concept c with an associationstrength evaluated by ψ in the range between 0 and 1. L is the set of labels lassociated to a given ontology concept c in a given language and E is the set ofdescriptions e associated to such concepts, in the same, given, language.

Ontologies, in H-DOSE, shall conform to this approach, however they are allowedto refer or to have links to other already existing ontologies. In such a case, multilin-gualism is supported only for the internal ontology, or better, only for the ontologiesadopting the above format. Multilingualism support for external ontologies dependson how such a concern has been addressed by the designers of the external models.

65


5.1.2 Resources

Resources in the H-DOSE platform are considered to as texts either written inHTML, XHTML or plain text. The main motivation for such an assumption is thatthe platform has been designed for supplying semantic services to nowadays webapplications, and texts are currently the most diffused resources on the Web. How-ever, in order to take into account the 7th non-functional requirement for a semanticplatform, resource support is designed to be easily extensible at the multimedia case.In H-DOSE, therefore, a resource is “something, in the platform knowledge domain,about which some information can be provided”. Where the term “something” isassumed to be a document, either textual or, in a future extension, multi medial.

Documents can be entire web pages but they can also be simple, homogeneouschunks of text (or video in a near future). In this last case, support for definingrelationships between fragments is provided, allowing to specify which fragmentbelongs to which page.

Semantic annotation and retrieval of fragments is one of the innovative points ofthe H-DOSE platform. Fragments, in fact, allow to take into account different levelsof granularity in classification and retrieval of resources. This allows, for example, toextract and retrieve only those document pieces that are similar to a given user query,eliminating all the disturbing content that usually contours relevant information suchas banners, links, navigation menus, etc. Moreover, the ability to specify the mutualrelationships between fragments, and between fragments and pages, allows to reduceredundancy at most, still maintaining the relevance of provided results.

So, if many fragments of a given web page are relevant with respect to a userquery, the entire page is retrieved rather than each of its component. In other cases,if a document about the jaguar life is well articulated in sections, including, for ex-ample, an introduction, some more detailed sections and a conclusion, the retrievedresult may vary depending on the level of granularity of the user query. Therefore,if a user requires “documents about the life of jaguar” only the introduction may beretrieved, while if the query is more specific: “documents about jaguar nutrition inthe Bangladesh jungle”, or if the user chooses to deepen the previous query, the H-DOSE platform can retrieve the internal sections of the document, better adaptingresults to the user query.

Besides the granularity with which resources can be manipulated by H-DOSE,also the storage policy deserves some attention. In H-DOSE documents are notmanaged directly by the platform. That is to say, H-DOSE does not store indexeddocument into its internal database. Such design choice allows, from one side, toleverage the already existing and probably more efficient storage facilities of CMSs,learning systems and IR systems. On the other side, it permits to semanticallydescribe resources that do not belong to the site in which the platform is deployed.H-DOSE does not make any assumption on the property of annotated resources,

66


they can be owned by whom deploys the platform as well as by other actors. Inany case they can be indexed, classified and retrieved by the platform through theadoption of an external annotation scheme where resources are identified by meansof URLs and XPointers.

With respect to the ontology formalization of the previous section, resources inH-DOSE are formally described as follows:

D : documents d ∈ D

be they entire pages or fragments.

5.1.3 Annotations

Annotations are the most important component of a semantic platform since theyallow to describe, from a conceptual point of view, the resources, and since theyconstitute the mean for providing conceptual based functionalities such as semanticsearch.

H-DOSE, according to the functional requirement number 13, adopts an ap-proach that keeps annotations and resources well separated. With this approachseveral critic issues can be addressed: as first, one could annotate non-owned re-sources, i.e. resources for which the classifier has not editing rights, secondly, anno-tations are only loosely coupled with the resource they annotate. This allow a pageto change formatting or slightly change content, while the corresponding annota-tion can stay unchanged. On the other side, if a resource becomes obsolete an it isretired, the system can address such an issue by simply deleting the correspondingdescriptions. Annotations in H-DOSE are formally defined as:

A : annotations a ∈ A

A ⊆ C ×D

ρ : A→ [0,1]

Where D is the set of all resources d suitable for annotations, A is the set of semanticannotations relating the resources in D with the ontology concepts in O and ρ is theassociation weight between resources and ontology concepts. The weight functionρ allows to specify different degrees of “relatedness” to ontology concepts, in away similar to what does the Vector Space model for classical information retrievalsystems, thus obtaining a flexible way of representing knowledge and of tacklingresource ambiguity. In RDF notation, a H-DOSE annotation looks like:

<hdose:annotation rdf:ID=’15643’>

<hdose:topic rdf:about=’’#jaguar’’/>

67


<hdose:document rdf:about=’’#doc123’’/>

<hdose:weight rdf:datatype=’’&xsd;double’’>0.233</weight>

<dc:author> H-DOSE </dc:author>

<hdose:type> auto </hdose:type>

</hdose:annotation>

Every resource semantically classified by the platform is pointed by many anno-tations, each relating the document with a given concept in the platform ontology,with a certain weight. Clearly, the number of annotations per document can easilybecome very high and difficult to manage, as ontology concepts may be numerousand documents can span a great number of different, related, topics. Therefore,in this work, a new knowledge representation is introduced that retains the abil-ity to include information from ontology structure, and from knowledge discoveryprocesses such as logical inference, through the definition of an expansion operator.Such a representation3, allows to collapse all the annotations referred to a resourcein a single “Conceptual Spectrum”.

A conceptual spectrum is formally a function mapping a given concept c toa positive real number σ(c) expressing the relevance weight of such concept withrespect to a given resource:

σ : C → �+

Together with the ability to merge fuzziness of documents with the crispness of theontology specification, spectra are also useful for performing some visual inspectionof knowledge bases. In fact, they can be visualized using the ontology concepts asthe x-axis and the σ(c) values as the corresponding y-values (Figure 5.3).

Since concepts do not possess an implicit ordering relation, the x-axis can be definedin several ways, allowing the analysis of different aspects of a knowledge base. Depth-first navigation of the ontology using the “subclassOf” relationship, as an example,orders similar concepts, at the same granularity, into nearby positions, allowing gooddiscrimination of the ontology sub-graphs involved into a document annotation.Breadth-first navigation, instead, allows the detection of the abstraction level ofthe indexed resources by putting together concepts lying at the same level in theontology. Anyway, the resulting graphs have exactly the same capabilities in termsof expressive power and matching properties, being different only in their visualinterpretation.

A conceptual spectrum σd associated to a document d is a conceptual spectrummeasuring how strong the association between ontology concepts and the documentis, taking into account the contribution of semantic relationships involved in the

3Firstly presented at ICTAI 2004, International IEEE Conference on Tools with Artificial In-telligence, Boca Raton, Florida

68

5.2 – Principles of Semantic Resource Retrieval

Figure 5.3. A “raw” conceptual spectrum (as obtained by simply composingannotations).

knowledge domain. Formally:

σd : σd(c) =∑

(c′,d′)∈A∧c=c′∧d=d′ρ(c′,d′)

i.e., for each ontology concept c, the document conceptual spectrum value is definedas the sum of ρ contributions extracted from all the annotations associating thedocument d with the concept c.

5.2 Principles of Semantic Resource Retrieval

Several concept-based search services proposed in the Semantic Web communityrely, to a certain degree, on logic inference to extract resources from a given domainin response to a user query. Basically, they use a reasoning engine to verify aninput clause built from the user query and they subsequently rank retrieved results.However, some problems arise when dealing with document retrieval where resourcesare annotated through a “speak about” relationship rather than being instances ofspecific ontology concepts.

5.2.1 Searching for instances

Conceptual search engines provide information to users by working on conceptsinvolved by the user query and occurring into indexed resources. Their action isorganized in two phases: the first uses inference for extracting conceptually relevantinstances, while the second tries to discriminate retrieved resources, with respect tothe user query or to the kind of retrieval process, assigning a different relevance value

69


to each of them. The resulting resource set is ranked according to the computedresource relevance and proposed to the user.

The first phase of a conceptual search includes logic reasoning or logic inference.The user query is represented as a clause or as a set of clauses to be logicallysatisfied by facts and axioms defined in the domain ontology, and the knowledgebase is subsequently surfed to find suitable instances. In other words, instances andontology concepts are merged into a cumulative graph that is surfed by the reasoningengine for finding a match between the modeled knowledge and the user query. Incase no matching is found nothing could be deducted except that the knowledgebase does not model the given domain with enough information to answer to theuser query. Otherwise, the set of facts and axioms satisfying the query is provided,allowing relevant resources identification.

It is important to notice that all provided results are equally relevant with respectto the user query since they are all able to satisfy the query logical clauses. However,discrimination should be performed, according to some external measures, in orderto provide a small, highly relevant set of results to the final user; this issue isaddressed in the second phase.

In that phase an ordering function is defined among the set of resources obtainedthrough logical inference. As an example, we might want to query our knowledgebase for lovely cats living in our city. The inference process would provide all catinstances for which the properties “live in (city)” and “is lovely” hold. Howeverthere are no means to evaluate how much lovely is a cat, from the logical point ofview. Some more information can therefore be taken into account for ranking resultsaccording to user needs, the amount of “loveliness”, for instance, expressed as thepercentage of persons judging a given cat lovely.

5.2.2 Dealing with annotations

Although the inference process is effective on instance search, there are some issuesthat should be addressed when dealing with documents and semantic annotations.Annotations can be defined both by humans, by carefully analyzing the contents ofresources with respect to the ontology, or by machines, using information extractionalgorithms. In the first case annotations possess a great degree of trustworthinesssince they are defined by “experts”, but they are relatively few, since they are veryexpensive to create. On the other hand, machine extracted annotations would fairlybe in a great number but it is likely that they will be less precise than the humangenerated ones.

A relevance weight specified into each annotation predicate allows therefore tak-ing into account those different degrees of trustworthiness and reliability, supportingthe definition of different association strengths between documents and correspon-dent concepts in the ontology. In other words, since documents would generally

70


span arguments broader than a single concept, they will be pointed by a consid-erable amount of weighted annotations each relating the document with a givenontology concept.

Annotations, in opposition with instances, deal with knowledge sources thatretain a certain degree of fuzziness, which is addressed through the definition ofannotation relevance; however ontology constraints are still valid, and the domainmodel of concepts and relationships still needs to be taken into account in orderto provide semantics rich services. In particular, inference is still useful to enablesystems to discover previously unknown knowledge; the only further requirementis that fuzziness of resources and crispness of logic inference merge together in acommon environment.

In H-DOSE the mean for such a merging is provided by the “Expansion Opera-tor” defined for conceptual spectra. Considering the spectrum definition provided insection 5.1.3, it is easy to notice that spectra are simply a way for collapsing manyannotations in a single object. They do not take into account the ontology struc-ture, i.e. the conceptual specification of domain semantics provided by conceptsand relationships in the ontology. Moreover, when annotations are automaticallyextracted, they retain a considerable noise component that affects the resource spec-trum, adding uncertainty to the correctness of the conceptual representation: therecan be wrong annotations, missing annotations, etc.

In order to overcome such an issue it is mandatory to exploit the ontology modelincluding all available information, with a particular focus on semantic relationshipsbetween modeled concepts. In particular, sub-symbolic information, expressed asthe strength of relationships, must be taken into account in order to correctly eval-uate conceptual spectra, reducing the sensitivity to the annotation noise. Relationweights, in fact, allow to take into account how much a concept is related to anotherconcept in the ontology, adding a quantitative information that is complementaryto the logical constraints on ontology navigation defined by the relation semantics(transitivity, inheritance,...). Such values are therefore critical for an information-related application since they establish how much a relation correlates two differentconcepts and how much such a relation should contribute to the definition of re-source semantics (spectra). Relation relevance weights must be specified during theiterative ontology design process and must be validated by domain experts in orderto assess the conceptual coherence of the resulting knowledge base.

To gain a better focus on this issue we could think about conceptual spectrumcomponents as concept clouds in the ontology. Strongly related concepts are groupedtogether by means of relationships, and annotations can be seen as the startingseed of these groups. Even if the related concepts do not appear into the originalannotation set, due to wrong or missing mappings, they should take part into thedocument conceptual specification as they are conceptually related to the spectrum.Therefore a new spectrum operator should be defined to discover the set of clouds

71


relevant for each specific component of the conceptual spectrum associated to agiven resource, taking into account both non-explicit and sub-symbolic knowledgesuch as inferred relationships and relations strength.

Such an operator is called the Spectrum Expansion operator X, formally definedas:

X : (C → �+) → (C → �+)

(Xσ)(C) = σ(c)∑

(c,c′)∈R∗χ∗(c,c′) · σ(c′)

The expansion operator X thus provides a new conceptual spectrum Xσ as output,whose value, for each ontology concept c, is defined as the sum of the originalspectrum value σ(c) and of the overall contribution from the concepts c′ related toc. This last value, in particular, is computed by taking the original spectrum valuefor each related concept c′ and by multiplying it by the strength χ∗ of relationshipsbetween c and c′ in the transitive closure R∗ of the space of relationships R.

R∗ : transitive closure of R

P (c,c′) : path from c to c′ in R

∀(c,c′) ∈ R∗, χ∗(c,c′) = maxP (c,c′)

∏

(c′′,c′′′)∈P (c,c′)

χ(c′′,c′′′)

The transitive closure R∗ includes the definition of the χ∗ function as the maximumstrength path between c and c′, where the path strength is computed by multiplyingall χ values of relationships included in the path.

In other words, the Spectrum expansion process takes a raw conceptual spec-trum σ as input, i.e., a spectrum as it comes from the manual or the automaticannotation process. Then, it processes such a spectrum by analyzing the ontologyand by propagating the relevance weights of each spectrum component, throughboth hierarchical and non-hierarchical relationships. X consists, therefore, of graphnavigation on the ontology, where each relation has an associated weight that as-sesses the conceptual distance between linked concepts, and an orientation, thatidentifies the ways into which navigation can be performed. The navigation resultis an enhanced spectrum Xσ in which original topics, together with their relevanceweights, appear surrounded by clouds of topics extracted by means of the expansionoperator.

Expanded spectra as the one in Figure 5.4 are much more useful for search taskssince they cover every possible nuance of document semantics according to a givenontology. This allows to retrieve both documents where required concepts directlyoccur and other, related documents, even if the original query does not exactlymatch their raw (not expanded) semantic classification.

72


Figure 5.4. The spectrum of Figure 5.3 after the expansion.

5.2.3 Queries

One of the valuable properties of conceptual spectra is that they can be used torepresent documents as well as queries. A query is formally defined as follows:

Q : queries q ∈ Q

Q ⊆ 2W

Where 2W is the set of all possible combinations of words w in W .Given that for each ontology concept c a synset s ∈ S is specified, then a query

conceptual spectrum can be extracted as reported below:

σq : σq(c) =∑

(c′,w′)∈S∧c=c′∧w′∈q

ψ(c′,w′)

For each concept c in the ontology, a query spectrum is defined as the sum ofcontributions of all query terms w ∈ W modulated by the ψ function associated tothe relation S.

5.2.4 Searching by conceptual spectra

H-DOSE adopts a semantic search paradigm based on the notion of conceptualspectrum. The search process is organized as follows: firstly the entire annotationbase is expanded into expanded conceptual spectra that form the system knowledgebase. This operation should only be performed once and can be totally off-line.Secondly a series of operations are performed that, starting from a user query,provide a ranked set of URIs.

73


When a query is received, it is mapped into ontology concepts, obtaining thespectrum σq, according to the query spectrum defined in section 5.2.3. The resultingspectrum is then expanded in order to get an expanded spectrum comparable tothose stored in the operating knowledge base.

A spectrum matching algorithm subsequently extracts, from the knowledge base,the most relevant associations, identifying a set of documents which are relevant tothe user query. Retrieved URIs are finally ranked according to a similarity measurewhich takes into account the form difference between the spectra of query and ofretrieved documents, and the top URIs are provided as result.

Query matching

Document spectra are expanded at indexing time and stored into the annotationbase that will be used for query matching and relevant document retrieval. Con-versely, queries are translated into expanded spectra at runtime, before searchingthe annotation base for a match. However, from a search engine point of view, theyare both spectra into a common, homogeneous space and the retrieval task simplycorresponds to resource matching into such space.

In order to perform spectra matching, a similarity function can be defined ex-tending the Vector Space model for information retrieval. This extension interpretsthe two spectra to be compared as two vectors into an n-dimensional space, whereontology concepts c represent dimensions and the correspondent relevance weightsσ(c), computed by means of the expansion operator X, are the vector components.

Searching for a match in terms of shape is, in that space, equivalent to searchfor vectors having the minimum angular distance between them, i.e. to search forvectors with similar directions. Therefore, the similarity Sim(q,d) can be simplycomputed by evaluating the cosine of the angle between the two spectra.

Sim(σq,σd) = cos(φq,d) =σq · σd

|σq| · |σd|

5.3 Bridging the gap between syntax and seman-

tics

Until now, methods for representing conceptual descriptions of resources and queries,and methods for matching them in the same vectorial space have been explained.However nothing has been said on how to extract these descriptions from real worlddata such as HTML or XHTML documents.

The ontology definition adopted by H-DOSE already provides means for a simple“bag of words” classification paradigm. This paradigm simply works by searchingdocuments for words occurring in synsets. For each occurrence found, a score is

74

5.3 – Bridging the gap between syntax and semantics

computed, as an example by using a tf · idf weighting scheme. The contributionsof all the words in a synset S associated to a concept c are composed in the valueof a single spectrum component relative to the concept c, thus defining the finalsemantic characterization of a given resource.

The “bag of words” is a quite diffused, although very simplistic, method forclassifying resources into classes (in the H-DOSE case, concepts in a ontology).Other more sophisticated methods can be applied, for example the last version ofthe H-DOSE platform uses SVM (Support Vector Machine) classifiers for associatingdocuments to ontology concepts. Besides discussing the effectiveness of classificationmethods, which have been extensively studied in the “Information retrieval” researchfield, as the TREC conference series [14] demonstrate, the main focus here is onfilling the gap between pure syntactic classification and semantics. In such a case,the “bag of words” although very simple, is a quite powerful method since words insynsets are defined by experts and discriminate resources quite precisely, by usingtacit human knowledge, i.e. the skills of domain experts, to ensure the correctnessof mappings. In addition, synsets so defined can also be used as seminal traininginformation for more complex classifiers such as SVMs. Unfortunately these expert-defined synsets, while being assumed error-free, are clearly not exhaustive, nor theyare sufficient for performing effective enough classification. Nevertheless they canrepresent a seed information for automatically widening the lexical coverage of thesemantic platform, thus allowing to better recognize the context of indexed resourcesavoiding errors related to ambiguity of terms.

In H-DOSE, before applying any kind of classification method, the expert de-fined synsets are therefore expanded to a more usable size (around 10-15 terms foreach ontology concept). Two methods4 have been developed for performing thisexpansion, which can work in cooperation to maximize the expansion effect whilereducing at most the errors that the process inevitably introduces. Both methodsleverage Wordnet-like lexical networks for extracting relevant terms: the first rec-ognizes the sense of terms by using the ontology structure, while the second usesstatistical information on term co-occurrence.

5.3.1 Focus-based synset expansion

The simplest synset expansion is the retrieval of a group of words in synonymousrelationship with the existing ones. Such step could be easily performed using lexicalnets, starting from a minimal set of words defined by experts and applying a transi-tive closure of the synonymous relationship. However, this would neglect the senseparameter, and it does not guarantee that the expanded synset is self-consistent

4Firstly presented at SAC2005, ACM Symposium on Applied Computing, Santa Fe, New Mexico

75


and wholly relevant with respect to concept specification. For example the con-cept “Business broker” could be represented by the expert-defined word “Agent”.Unfortunately “Agent” in the computer science context is a software entity withno connection at all with financial markets. Therefore simply retrieving synonymsfrom lexical nets would produce an inconsistently expanded synset in which bothsoftware and finance terms appear. With these premises, for every word to be usedin the expansion process it is necessary to discriminate between its different sensesto identify the synonyms according to the relevant sense only, and to avoid addingmisleading results to the knowledge base. This action in the literature is typicallycalled sense disambiguation. To perform sense disambiguation, H-DOSE uses a tech-nique inspired by the focus-based approach defined by Bouquet et al. [15]. In suchapproach the focus is defined as a concept hierarchy containing the original node,all its ancestors, and their children. Thus the focus of a concept is the part of theontology necessary to understand its context. Let us consider an example to betterclarify the definition: referring to the ontology in Figure 5.5, we want to expand thesynset of the concept “Agent”.

Figure 5.5. The focus of “Agent”.

Concepts surrounded by dashed lines compose the focus for the concept “Agent”.If a search for “Agent” synonyms is performed in WordNet, as an example, oneof the provided terms is “broker”. However, in the above ontology, agents havenothing in common with businessmen and are only proactive software entities. Thefocus of the concept would easily allow rejecting the term “broker” since it is notrelated with software, software entities and so on. Such approach, although beingquite effective in sense disambiguation, in the definition given in [15] has somelimitations. The original approach relies on label names to contextualize a concept,and this brings some extra constraints for ontology designers. Beyond that, theapproach is not scalable to multilingual environments in which concept labels aremeaningless. Finally, relying on a single word per concept to understand the contextincreases the chances of misunderstanding the correct sense. For this reason, thetechnique here depicted, while finding its inspiration on the approach described

76

5.3 – Bridging the gap between syntax and semantics

and in particular on the focus of a concept, can disambiguate senses maintainingfull multilingual support and making full use of the H-DOSE ontology definition.Instead of concept labels synsets defined by experts are used, obtaining, on one side,a greater amount of possible terms, since the starting set is wider than a single word(the label), and, on the other side, improving the sense detection capability becausemore words concur into the focus definition allowing better sense disambiguation.The underlying idea is that, for every word existing in the synset of a concept,we can detect its correct sense by checking the “similarity” between every synset5

(which depends on the sense parameter) of the lexical net and the synsets of thefocus of the concept (see Figure 5.6 for the method pseudo-code).

Vector synset = getSynset(ontoClass);

Vector expansion = new Vector();

Vector focus = calculateFocus(ontoClass);

Vector focusWords = new Vector();

for-each concept in focus

focusWords.add(concept.getSynset();

end for

for-each word in synset

Vector senses = SemNet.getSenses(word);

int counter[senses.size];

for-each fWord in focusWords

for-each sense of senses

counter[sense] = occurrences of fword in

SemNet.getSynonyms(word,sense);

end for

end for

maxSense = sense for which counter[sense] is max;

expansion += SemNet.getSynonyms(word, maxSense);

end for

return expansion;

Figure 5.6. Pseudo-code of the focus-based expansion

5There is a clash between the term “synset” adopted for noting words associated to conceptsand the term “synset” adopted in lexical networks for defining words with similar senses. Here theterm “synset” is referred to the last acceptation.

77


5.3.2 Statistical integration

The focus-based method has been integrated into a more powerful and general synsetgeneration architecture, where also statistical information of word co-occurrence istaken into account. The underlying assumption is that, although the effectivenessof the focus approach is high, some relevant words could be missed or simply mis-classified and rejected. To address such an issue, strategies have been designed fordetecting missed relevant words, using information already encoded in synsets.

Vector expansion = new Vector();

int length = synset.length;

for-each lexicalEntity in synset do

allSynonyms= get all the lexicalEntity.synonyms from

SemNet without caring of the sense;

end for

for-each syn in allSynonyms

if (syn.numberOfOccurrences > threshold*length)

expansion += syn;

end if

end for

return expansion

Figure 5.7. Pseudo-code of the statistical-based expansion

The first step of these refinement strategies takes as input a synset and performsa search on lexical nets retrieving, for each synset term, all the available synonyms,without caring of senses. Then a set of policies is applied basically searching thesynonyms set for co-occurrence of words. This approach is purely statistic: everyterm is ranked according to its occurrence frequency and filtered using an adap-tive threshold. Repeated terms are likely to be related with the context to whichthe ontology is referred. The pseudo-code of the method which, having received asynset, returns the expansion, is presented in Figure 5.7. To achieve better results,it’s possible to integrate this approach with the formerly presented one by usingthe output of the focus-based expansion as input for the statistical method. Inthis way, the statistical integration, together with the major contribute from thefocus-based expansion process, takes part into the final synset definition allowingfor the creation of a suitable set of words. The overall word set precision, withrespect to the conceptual specification, may decrease in the expansion process, stillremaining effective enough to be used by classification engines. Assuming that theexpert-created synsets have a precision of nearly 100%, the automatic expansion,while bringing new useful information, may in fact include some misleading term.

78

5.4 – Experimental evidence

Therefore, the increase in synset size (and conceptual coverage) is usually balancedby a potential loss of precision.

5.4 Experimental evidence

This section provides the experimental evidence6 for the principles introduced inthe previous sections, and in particular for the multilingual approach, for the spec-trum representation and the spectra-based search, and for the synset expansiontechniques.

5.4.1 Multilingual approach results

Multilingual functionalities in the H-DOSE platform have been tested in a sim-ple experimental setup, aimed at assessing the feasibility of the adopted approach.The tests used a disability ontology developed in collaboration with the Passepa-rtout service of the City of Turin, a public service for disabled people integrationand assistance in Italy. Two HTML pages have been randomly selected from thePassepartout web site that were available in English and in Italian. The disabil-ity ontology, originally paired with an Italian lexicon, has been integrated with anequivalent English lexicon.

The first test aims at showing the architecture support to multilingualism; toachieve this goal Italian and English documents have been indexed, obtaining 85annotations for the English version and 58 for the Italian one, automatically recog-nizing the document language. Ideally, at the semantic level, the platform shouldannotate with an equivalent set of concepts all pairs of translated page fragments.To verify this property, the annotations stored into the repository for both Italianand English documents have been analyzed and the correlation factor between de-tected fragments computed, at different levels of fragmentation, i.e., at the 〈Hx〉tag level (see Table 5.1) and at the 〈P〉 tag level (see Table 5.2). The adoptedcorrelation measure is the cosine of the angle between the vectors representing thedocument fragments, defined according to the classical vector space model and con-sidering each concept as an independent dimension. Rows and columns are labeledaccording to the source of fragments, using a “language/page/fragment” syntax.

Correlation data, at each fragmentation level, has been grouped into two sets: thefirst is composed by elements lying on the table diagonal while the second includes allremaining elements. Elements lying on the table diagonal are the correlation factorsbetween the same fragments expressed in different languages. Figures 5.8 and 5.9depict the correlation distributions for the two sets at the 〈Hx〉 fragmentation level

6Results in this section have been extracted using the former version of the H-DOSE platform,simply called DOSE, which supported fragmentation at the HTML tag level.

79


IT\ EN Pg1/H1[1] Pg1/H2[1] Pg2/H1[1] Pg2/H2[1] Pg2/H2[2]Pg1/H1[1] 0.69 0.62 0.28 0.29 0.34Pg1/H2[1] 0.73 0.74 0.19 0.18 0.27Pg2/H1[1] 0.30 0.36 0.69 0.46 0.54Pg2/H2[1] 0.19 0.22 0.54 0.50 0.38Pg2/H2[2] 0.21 0.25 0.49 0.36 1.00

Table 5.1. Correlation factor between fragments at the 〈Hx〉 fragmentation level.

IT\ EN Pg1/P[2] Pg1/P[3] Pg2/P[1] Pg2/P[2]Pg1/P[2] 0.59 0.36 0.21 0.30Pg1/P[3] 0.00 0.79 0.00 0.00Pg2/P[1] 0.32 0.00 0.50 0.38Pg2/P[2] 0.36 0.00 0.36 1.00

Table 5.2. Correlation factor between fragments at the 〈P〉 fragmentation level.

and at the 〈P〉 level, respectively. It is easy to notice that the distribution of thefirst set and that of the second set are nearly separated: this results conforms tothe initial expectations and ensures that the DOSE platform can work effectivelyin a multilingual environment. In fact, it shows that correspondent fragments, indifferent languages, are annotated as belonging to a common set of concepts whilenon-correspondent fragments are kept distinct by annotating them with reasonablydifferent concepts.

Figure 5.8. Correlation between fragments at the 〈Hx〉 level.

80


Figure 5.9. Correlation between fragments at the 〈P〉 level.

5.4.2 Conceptual spectra experiments

Several tests have been set-up in order to assess the feasibility of the spectrumrepresentation together with the effectiveness of the spectra-based search process.The tests used an ontology developed in collaboration with the Passepartout ser-vice of the city of Turin. The ontology is composed of about 450 concepts relatedby means of either hierarchical or non-hierarchical relationships. Eight types ofnon-hierarchical relationships have been defined in the ontology, such as: “Implies”,“Defined by”, etc. For each relationship a weight has been specified by domainexperts, starting from the “isA” relation that has been considered as the strongestone. Around 1’000 documents have been indexed, from the Passepartout web site,obtaining around 40’700 semantic annotations. Resources have been split into sev-eral fragments and each fragment has been separately annotated in order to providefine granularity search results.

Starting from such a relatively small annotation base a search test has beendeployed involving 10 people in the evaluation phase. Each evaluator has beengiven a set of queries (Table 5.3) to be performed and has been asked to judgeresults relevance with respect to queries, assigning, to retrieved documents, valuesthat range from 0 (not relevant) to 100 (fully relevant). The search interface was asimple PHP front-end for the DOSE architecture allowing keyword based searches.

Half of the evaluators used a previous, keyword based, search engine alreadyavailable in the platform while the second half exploited the proposed spectra-basedengine.

Results have been collected and grouped by query and the corresponding rele-vance values have been analyzed and used to draw the following relevance charts. As

81


Queries (Italian / English)Cieco / Blind

Lavoro Disabile Cieco / Job Disabled people BlindTrasporto Disabile / Disabled people Transportation

Diritto Sordo Lavoro / Rights Deaf JobAgevolazioni Cieco / Facilitations Blind

Table 5.3. Test Queries

could easily be noticed in Figure 5.10, the spectra search engine is, on the average,able to provide more relevant results than the keyword based one and it is also ableto provide a better ranking of retrieved documents providing more relevant resultsin the first positions of the whole retrieved set.

Figure 5.10. Comparison between the previous search engine and the spectra oneon the query “lavoro disabile cieco” (Job Disabled-people Blind).

Results for the entire evaluation phase have been grouped into a chart (Figure5.11) that shows the average relevance achieved by both search engines on the fivedifferent queries, according to evaluators’ judgment. Such results have been weightedby taking into account the ranking position of retrieved documents, therefore themean relevance value R on the y axis has been computed as follows.

R =∑ rd

rank(d)

Where rd is the relevance of the document d and rank(d) its ranking order.The overall performance of the proposed spectra search engine overcomes the one

of the traditional search engine for every query issued. The weighted combinationof better ranking results and better precision, underline the approach feasibility and

82


Figure 5.11. Mean relevance scores for both engines over all queries.

provide an indication of the effectiveness of the approach, able to work at differentconceptual granularities for both queries and indexed documents.

5.4.3 Automatic learning of text-to-concept mappings

The synset expansion and integration techniques introduced in section 5.3 havebeen tested using the Passepartout ontology adopted in the previously detailedexperiments. Firstly the focus-based expansion was applied, using WordNet aslexical network, followed by the statistical integration. Starting from an originalset of 1973 terms associated to the ontology concepts, the focus-based techniqueextracted 1200 more words, providing a size increase of about 61%. The subsequentapplication of the statistical integration lead to the extraction of other 700 termscorresponding to the 21% of the focus-expanded sysnsets. The precision of resultswas higher for the latter set, still being valuable for the focus-extracted terms. Table5.4 reports the corresponding figures.

Synset type Size PrecisionOriginal 1973 100%

After focus expansion 3221 89%After statistical integration 3897 91%

Table 5.4. Synset expansion results.

In order to perform a better evaluation, the impact of the expanded lexicalmappings on the process of indexing and retrieving resources has also been tested.In order to perform such evaluation 80 pages from the the Asphi website [15], an on-line collection of resources and services for disabled people, have been indexed using

83


a simple “bag of words” indexing powered, respectively, by the original synsets, bythe focus-expanded ones and by the statistically integrated set of terms. Five querieshave been issued for each set and, for each query, only the first 20 retrieved resourceshave been evaluated in terms of precision and recall. As shown by Figures 5.12 and5.13, while the recall figure increases as more terms are adopted, the precision valuesstay nearly unchanged, thus showing the effectiveness of the approach.

Figure 5.12. Recall results for 5 queries on the www.asphi.it web site.

Figure 5.13. Precision results for 5 queries on the www.asphi.it web site.

84

Chapter 6

The H-DOSE platform

This chapter describes in deep detail the HDOSE platform, focusing onthe role and the interactions that involve every single component of theplatform. The main concern of this chapter is to provide a completeview of the platform, in its more specific aspects, discussing the adoptedsolutions from a “software engineering” point of view.

The main motivation for the H-DOSE design is to provide a system actually usablein the nowadays Web. According to the vision of this work, such a result shallbe reached through the analysis of requirements perceived by the web actors (andreported in Chapter 4). In order to respond to some of these requirements, non-functional requirements especially, a so-called holistic approach has therefore beenadopted. The term holistic refers to the integration of different techniques, andin particular web services and multi-agent systems, into a common platform forsemantic elaboration.

As emerges from the non-functional and usability requirements number 1 and 3,the H-DOSE platform shall be “easy integrable” into already existing publicationframeworks and, shall be independent from the server side technologies adopted forpublication.

Web services are the state of the art technology for supporting such functional-ities. They, in fact, allow system interoperability, by adopting open Internet stan-dards, by being able to describe their own functionalities and location and by inter-acting with other Web Services. They are cost-effective since they replace tightlycoupled applications, with related problems of data passing and interface agree-ments, by offering a loosely coupled architecture in which the business logic is com-pletely separated from the data layer. Moreover, web services allow reuse of func-tionalities by adopting standard description formats: WSDL for service logic, UDDIfor advertising and SOAP to communicate. The web service technology is very wellsuited for “access-type” services, which are not very computationally expensive and

85

6 – The H-DOSE platform

that usually do not take advantage of replication and distribution amongst differ-ent locations. Web services are, in fact, more suited to accomplish interface tasksrather than to implement the internal business logic of applications, which is usuallydelegated to more effective processes.

H-DOSE accounts this feature by adopting web services as standard interfacesfor providing semantic services to SOAP-enabled applications. The internal com-plex tasks are, instead, delegated to the platform internal layers where they getexecuted through a quite different technology: multi-agent systems. Agents possessin fact characteristics that made them actually suitable for accomplishing very inten-sive tasks exploiting replication, distribution and location-aware computing. Theyare “living” software entities located into proper containers that constitute theirecosystem, and they have an adaptive and autonomous nature, and social capabili-ties enabling them to coordinate their own actions with others and to cooperate ornegotiate for reaching a specific designed goal.

Since the non-functional requirements, especially the requirements number 1 and2 under the category “performances”, indicate that scalability is one of the mainconcern for the platform deployment in real world applications, H-DOSE uses agentsto perform computationally intensive tasks such as automatic indexing of resources.The adoption of agent services, in fact, allows the natural distribution of tasks, andthus of the committed charge, toward the various information sources involved bythe classification. So, for example, if a given web site is offering a suitable containerfor agents, the indexing part of the H-DOSE architecture can be replicated on thatsite, allowing for local indexing and, distributing the indexing load through all theavailable information sources, depending on resource location.

6.1 A layered view of H-DOSE

The H-DOSE 1 platform adopts a modular, distributed architecture that is deployedon three different layers: the Service layer, the Kernel layer and the Data-access(Wrappers) layer (Figure 6.1).

6.1.1 Service Layer

The Service layer implements the interface between the architecture and the externalapplications that integrate the functionalities provided by the platform. In such alayer, the available services correspond to the high-level functionalities of semantic

1Firstly published in the proceedings of ICTAI 2003, Sacramento, California. The subsequentrevision, corresponding to the version presented in this thesis, has been published in proceedingsof SWAP 2004, Semantic Web Applications and Perspectives, Ancona, Italy.

86

6.1 – A layered view of H-DOSE

Figure 6.1. The H-DOSE architecture.

classification (indexing) and searching. They are respectively implemented by theIndexing service and by the Search service.

Indexing service

The Indexing service offers a public, SOAP-based interface for performing semanticclassification of textual resources. It is basically a queue manager for the correspond-ing kernel-level service, implemented as a multi-agent system. Two main operationtypes are supported, namely “indexing” and “batch-indexing”.

The indexing offers a simple, interactive way for semantically classifying re-sources. An external application invoking this operation type is basically required toprovide its name or unique identifier, and the URI of the document to be indexed.Some additional information can be provided if, for example, the document is afragment belonging to a more complex resource. In such a case a “partOf” attributespecifies the URI of the resource which contains the document. The service call, aswell as in the “batch indexing” case, is asynchronous and the application can decidewhether listening for annotation results or not.

Internally, the indexing operation simply enqueues the URI of the documentinto the kernel-level agent service, and if required maintains the reference to thecalling application for notifying the indexing result. Such a reference is not a clas-sical “pointer”, instead it is simply the endpoint of a web service (either SOAP orXML-RPC) to which the notification shall be propagated. This design choice allowsto keep the indexing service as stateless as possible, thus avoiding to store explicit

87


references to applications with related problems of session persistence, authentica-tion, etc. Figure 6.2 shows the indexing operation interface, in pseudo code, wherethe parameters denoted with an asterisk are optional.

void index(URI documentURI, URI applicationURI,

URI superDocumentURI*, Endpoint notificationEndpoint*)

Figure 6.2. Public interface of the “indexing” operation.

The batch-indexing operation is, as said by the name, focused on the indexing ofmany resources at time. This functionality is usually required when a consistent setof resources, e.g., a CMS article base, must be classified. The interface, in this caseis pretty much similar to the simple indexing operation, however the parametersare stored in arrays, one for the documents to be indexed and one for the optional“container” resources. The internal implementation simply loads the queue of thekernel-level service with the whole bunch of documents to be indexed. In additionan alternative interface is also provided for enabling indexing transactions in whicheach simple indexing request is enqueued into a single batch-indexing request (Figure6.3).

void batchIndex(URI documentURIs[ ], URI applicationURI,

URI superDocumentURIs[ ]*,

Endpoint notificationEndpoint*)

Or

beginBatch(applicationURI)

index(documentURI1, applicationURI,

superDocumentURI1*, notificationEndpoint*)

index(documentURI2, applicationURI,

superDocumentURI2*, notificationEndpoint*)

...

index(documentURIn, applicationURI,

superDocumentURIn*, notificationEndpoint*)

endBatch(applicationURI)

Figure 6.3. Public interface of the “batch-indexing” operation.

Typically, batch operations are accomplished by the same agent, in the kernel-level indexing service, thus allowing to easily keep trace of the corresponding clas-sification tasks. Unlike what happens in the simple indexing scenario, the possible

88


notification is referred to the whole ensemble of resources: an indexing failure doesnot indicate a failure for each resource in the ensemble but necessarily implies thatat least one document has not been classified due to errors. It is also importantto notice that the transaction mode violates the goal of maintaining the serviceas stateless as possible, however problems related to non-terminated transactionsare easily handled by the maintenance service, which periodically purges pendingnotifications, transactions, etc. from the indexing service queue.

Finally, the indexing service also provides a wrapped access to the kernel-levelannotation repository, which allows semantics-aware external applications to directlystore annotations into the H-DOSE knowledge base (Figure 6.4). However, caremust be used when adopting such a method, since the provided annotations mustbe created using the same ontology that powers the platform. If this constraint isnot respected, the platform generates an error and drops the implicated annotationsin order to keep itself in a consistent state.

void store(URI documentURI, URI applicationURI,

URI superDocumentURI*, URI conceptURIS[ ],

Float weights[ ])

Figure 6.4. Public interface of the “annotation-store” operation.

Search service

As stated in the previous chapter, the H-DOSE search services are semantic, i.e. theresults relevance, with respect to applications queries, is evaluated at a language-independent conceptual level. Such evaluation uses the methods explained in section5.2 and provides, as an output, a ranked set of resources in form of a list of URIs.Two operations are defined: a “search by concept” and a “what’s related” search.

In the “search by concept” functionality, the calling application must specify alist of relevant concepts, possibly accompanied by a corresponding set of weights,between 0 and 1, that specify the importance of each concept in the list. This twoinformations are automatically combined into a conceptual spectrum, which is thenexpanded using the expansion operator defined in section 5.2.2 and implemented bythe Expander service at the Kernel layer. Once expanded, the query spectrum is usedas seed information for the Annotation repository that retrieves all the documentdescriptions similar to the query spectrum. These descriptions are then ranked bythe Search engine according to their similarity with the original query. It mustbe noted that while descriptions are retrieved using an expanded query spectrum,which allows to select also the relevant resources that are not explicitly correlatedwith the initial query. The final ranking is performed by evaluating the similarity of

89


document descriptions with the original query, thus filtering those possibly wrongresults due to the expansion process. In other words, the expansion process widenthe potential recall of the search system while the final ranking tries to keep as highas possible the results precision. Figure 6.5 reports the “search by concept” interfacein pseudo-code.

URI[ ] search(URI conceptURIs[ ], Float conceptWeights[ ],

String resultLanguages[ ]*,int maxResults)

Figure 6.5. Public interface of the “search by concept” operation.

By looking at the interface pseudo-code, two more features can be identified:as first, since the platform is actually multilingual, the provided results can be inany of the supported languages, even in more than one at time. So for example, anapplication can require results in both English and German. The second observationis that the query is fully language independent, it is in fact composed of the URIsof concepts in the H-DOSE ontology. This, on one side implies that the applicationshall have a mean for mapping concepts and user-defined queries, as an example byproviding a directory-like search interface where each directory node corresponds toa concept URI. On the other side, it forces the application to be at least partiallyaware of the ontology used by H-DOSE. Otherwise the concept URIs cannot becorrectly specified. In order to supply a more traditional access to the search servicea slight variation of the “search by concept” functionality is also available, whereapplications can provide keyword-based queries. In such a case the keywords areconverted into a conceptual spectrum by means of a sub-module of the indexingservice (known as Semantic Mapper). Then the search process continues as in thenormal case providing as final result a set of possibly relevant URIs (see Figure 6.6for the interface specification).

URI[ ] search(String keywords[ ], String resultLanguages[ ]*,

int maxResults)

Figure 6.6. Public interface of the “search by keyword” operation.

The “what’s related” search functionality is specifically designed to respond tothe 6th functional requirement specified in section 4.1. It offers a very simple inter-face in which a document URI shall be specified together with the maximum numberof required results (Figure 6.7 shows the corresponding interface).Whenever the search engine service receives such a request, it simply forwards thereceived URI to the “reverseSearch” method of the annotation repository, which inturns provides as result a list of documents whose spectra are similar to the spectrum

90


URI[ ] whatIsRelated(URI documentURI, int maxResults)

Figure 6.7. Public interface of the “what’s related” search operation.

of the document identified by the given URI. Then the retrieved results are rankedaccording to their similarity with the spectrum of the query document, provided bythe annotation repository, and returned to the calling application. If the number ofavailable results is bigger than the maximum number of required results, only thetop maxResults URIs are provided.

A last remark shall be provided about the search engine service: the similarityfunction used by this service is exactly the one defined in section 5.2.4 while, in theannotation repository, similarity is evaluated in a very approximate form privilegingthe response time with respect to precision in the evaluation.

6.1.2 Kernel Layer

The Kernel layer is where the most complex and computationally intensive taskstake place. Such tasks are logically subdivided into classification tasks and retrievaltasks; modules are separated according to these logical functions. In particular,the Expander service is much more used in the retrieval process while the Indexingsub-system, as can easily be noticed, is devoted to the classification process.

The Annotation Repository has a sort of dual nature since it is used in both in-dexing and retrieval; it is in fact charged of annotation management and persistence.This dual nature adds a further performance requirement to such a module whichshall be fast enough to effectively handle requests coming from both the indexingand the search subsystems.

Expander service

The expander service implements, inside the platform, the spectrum expansion op-erator. It basically accepts as input a “raw”, or unexpanded, spectrum and performsthe spectrum expansion using the H-DOSE ontology. The result is newly a concep-tual spectrum, which takes into account the not explicit knowledge encoded in theH-DOSE ontology (see Figure 6.8 for the pseudo-code of the expander interface).

Spectrum expand(Spectrum rawSpectrum)

Figure 6.8. Public interface of the expander service.

The expansion operator can be proficiently applied in the search phase, whateverbeing the interaction paradigm used, except the “what’s related” one.

91


When a user specifies a query, either by directly selecting concepts or by pro-viding some “relevant” keyword, the resulting spectrum is usually composed of fewconcepts. In this scenario the expander can browse the ontology, according to se-mantic relationships, and expand the query specification by adding relevant, relatedconcepts to the initial specification. This operation potentially allows the systemto retrieve resources which are interesting for the user but that, without expansion,would not have been retrieved because they do not have direct associations with theconcepts originally composing the query.

In classification tasks, the expander could either be used or not. However, whilein the automatic classification task the expansion can be useful by explicitly ac-counting semantic relationships occurring in the platform conceptual domain, andby possibly overcoming mis-classifications due to word ambiguity, in the manualclassification task it could deteriorate the annotation quality by adding noise to theannotations entered by users that, in this case, are likely to be domain experts.

In the platform deployment adopted in this thesis the expansion process takesplace for both search and automatic indexing but not for direct annotation.

Annotation Repository

The annotation repository module is the sole responsible for annotation storage,retrieval and management. It accepts, on one side, storage requests, and writes thereceived annotations into the platform database by means of a proper wrapper. Onthe other side, it listens for search requests and has the ability to provide as resulta subset of the stored annotations in which the concepts included into the receivedquery are contained.

The Annotation repository is the most centralized module of the platform: itmanages all the annotation storage and search requests. For this reason it shall be asfast as possible, especially in the retrieval phase where the response time is a criticalfactor for achieving user satisfaction. For the same reason it possibly constitute abottleneck in the platform information flow.

This risk could easily be avoided by using service replication and, thus, by prop-erly configuring the platform to work with more than one annotation repository.To perform this configuration both the indexing sub-system and the service levelmodules should be aware of the presence of more than one annotation repositoryand should have references to the proper one. Correspondence between service con-sumers (i.e. the Service layer modules and the indexing sub-system) and serviceproviders (the Annotation repository copies) shall be fixed at configuration time.Clearly, care must be paid when deploying the platform in order to correctly config-ure H-DOSE choosing a good trade-off between the required performances and theavailable computational resources.

From a more operational point of view, the annotation repository exposes four

92


different interfaces: a “store” method, a “retrieve” function, a “reverse search”operation and eventually an “inverse map” procedure. The store method is usuallycalled by the indexing sub-system (either at the service or at the kernel layer) andallows to memorize semantic annotations into a persistent storage. Its signature isreported in Figure 6.9.

int store(Spectrum expandedSpectrum, Spectrum rawSpectrum,

URI documentURI, URI applicationURI,

URI superDocumentURI*)

Figure 6.9. Public interface of the store method of the annotation repository.

The retrieve function is, in a sense, the inverse operator of the store function andgiven a set of concepts, i.e. a spectrum provides as result a list of URIs which havespectra somewhat similar to the specified one (see Figure 6.10). The similarity issimply evaluated by checking the common spectra components which are not null,and by ranking results according to the number co-occurring concepts (i.e., spectrumcomponents).

URI[ ] retrieve(Spectrum querySpectrum, int maxResults)

Figure 6.10. Public interface of the retrieve method of the annotation repository.

What’s related searches in H-DOSE are particularly easy to perform thanks tothe availability of the reverse search functionality in the annotation repository. Suchan operation accepts as input a documentURI and provides as result a list of URIsof documents having similar spectra. Internally, the annotation repository works asfollows: at first the conceptual spectrum corresponding to the input URI is obtainedand then, a normal retrieve operation is performed for extracting the final result.It must be stressed that for the reverse search as well as for the former retrievefunction, expanded spectra are used so as to increase at most the chances of findingrelevant resources. Figure 6.11 shows the reverse search signature.

URI[ ] reverseSearch(URI documentURI, int maxResults)

Figure 6.11. Public interface of the reverse search method of the annotationrepository.

The inverse map functionality concludes this section. It basically provide a meanfor querying the persistent storage in order to retrieve the conceptual description ofan indexed document. Therefore it needs as input a document URI while providingas result a conceptual spectrum (Figure 6.12).

93


Spectrum inverseMap(URI documentURI)

Figure 6.12. Public interface of the inverse map method of the annotation repos-itory.

Indexing sub-system

The indexing sub-system is one of the most complex modules of the H-DOSE plat-form. For this reason firstly a black-box description is provided, and only in a secondinstance a more detailed system view is given, discussing each element of the box.

As stated by the name, the indexing sub-system provides the required function-alities for the automatic indexing of textual resources. Two methods are offeredto external applications, which can either be the service-layer indexing module orother modules that do not belong to the platform. These methods are respectivelyfor queuing resources to be indexed and for getting results. The latter is actually aservice callback registration.

Whenever a module needs to perform the automatic indexing of a resource, theindex method of the indexing subsystem is invoked. This method is available in twovariants: a single index function and a group index function. The twos only differfor the amount of data handled at time. The latter, in particular, is used to indexa considerable group of textual resources in a unique, atomic operation. For bothmethods, the corresponding signatures are reported in Figure 6.13.

void singleIndex(URI documentURI, URI applicationURI,

URI superDocumentURI*)

Or

void groupIndex(URI documentURIs[ ], URI applicationURI,

URI superDocumentURIs[ ]*)

Figure 6.13. Public interface of the “index” operation, at the kernel-level.

The indexing operation can be successful, so that the resource conceptual de-scriptions will be stored in to the H-DOSE persistent storage, or can fail. In bothcases, calling applications may want a notification of the automatic annotation re-sult. To obtain this notification message, they shall register themselves with theindexing sub-system. In the H-DOSE platform this task is automatically performedfor the indexing service at the service layer, which is by default registered with thekernel-level service. However if an external, semantics-aware application needs todirectly access the kernel-level indexing service it must register itself with the serviceusing the register notify method (Figure 6.14).

94


void registerNotify(Endpoint notifyEndpoint)

Figure 6.14. Public interface of the register notify method of the kernel-levelindexing sub-system.

Indexing sub-system insights

The indexing sub-system is implemented as a multi-agent system composed of bothresident and mobile agents. Resident agents either provide coordination services forthe mobile ones, as done by the agency manager, or implement services that shallremain centralized: as happens for the sematic mapper and the synset manager.Mobile agents, instead, are directly involved in the active characterization of textualresources.

As already discussed, the process of extracting knowledge from textual resources,i.e. of semantic classification, is a quite complex and computationally intensiveprocess. It requires on one side the ability to distribute as much as possible thecorresponding computation load, thus the adoption of mobile agents, and on theother side requires a well defined organization of the indexing work in order tomaximize efficiency. Both the concerns can be addressed by properly designingthe mobile part of the indexing sub-system. This part, in H-DOSE is composedby colonies of agents which are designed to autonomously accomplish the entireprocess, from text analysis to spectrum generation and storage. Such colonies arecalled Indexing squads and can be migrated on whatever device offering a suitablecontainer for the agents to live and operate.

Migration, nevertheless, shall not be performed randomly, instead it shall takeinto account the location of information sources in order to reduce as much aspossible the distance between the data and the processing code. The underlyingidea is therefore to move the indexing code toward the data instead of doing theopposite. Such operation pays an increase in complexity with a larger decrease in thesize of information that transits on the web, between the sources and the platform.In an intra-net setting this design choice can be questionable, however it is the onlyway to ensure scalability of the platform in “wild web” environments. Figure 6.15shows the composition of the indexing sub-system and provides a sort of “zoom”visualization of an Indexing squad.

Indexing squads are always composed by 3 agents: the media/language detector,the filter agent and the annotator agent. Sometimes the squad is integrated by acustom agent which allows to directly access a given site database. This agent iscalled deep search agent.

A normal indexing task (i.e, without the deep search agent being involved) worksas follows. Every time that the agency manager receives an indexing request, itchecks the documentURI(s) to extract the base site from which pages are to be

95


Figure 6.15. The H-DOSE indexing sub-system.

indexed. If an indexing squad is already deployed on the server publishing the site,the new indexing task is forwarded to the squad residing on the remote machine.Otherwise a discovery process starts, trying to find whether the remote host hasan agent container accessible or not. If a container is available and access to itcan be gained, the agency manger packs a new indexing squad and migrate theagents, together with the indexing request, to the newly found container. Instead,if no containers are available, the agency manager backs up the request on a set of“friend” machines. The “friend” machines differ from normal container providersas they not only allow the deployment of new agents on the container, but alsoallow agents to perform tasks which involve other machines on the Web. A typicalexample of a friend machine is the host on which the centralized part of the indexingsubsystem works.

96


At the end of the “migration phase”, the URI(s) of resources to be indexed arereceived by the filter agent of an appropriate indexing squad. This agent, firstlycontacts the media detector agent for understanding both the language in which thetext is written and the format (HTML, XHTML, plain text) of information. Withthis data it can configure its internal parser and code filter, and can then extractthe simple textual information required by the annotation agent. The output ofthe filtering process depends on the kind of annotation technique adopted by theannotator agent. As an example, a simple bag of words annotator may require allthe word stems in the document, while a SVM based classifier can require the set ofdistinct words which occur in the textual resource, accompanied by a tdf · if weightthat specifies their ability to discriminate the document from the others alreadyindexed. Whatever design choice is taken, the filter agent and the annotator agent,in a squad, shall be designed to work in symbiosis, with the same expected formatsand elaborations.

In the end, the annotator agent takes as input the filtered data and builds upa conceptual spectrum. Such spectrum is then stored in the H-DOSE knowledgebase through a call to the store method of the annotation repository. At last, theannotator notifies the indexing result (annotations stored or annotation failed) tothe agency manager, which then propagates the same information to the applicationsregistered with the kernel-level indexing service.

The deep search agent is involved when some particular agreement comes intobeing between people running H-DOSE and an organization managing a given site.In such a case the organization might allow a restricted access to its database (DB)tables through a certain interface defined in the agreement. A special purpose agentcan therefore be designed and deployed on the machine running the site. Thisspecial agent, the deep search agent, has the access rights and capabilities requiredto directly query the site database, extracting not only published information butalso metadata associated to the table structure of the DB. The deep search agent canbe very useful for highly dynamic sites where information changes very quickly andsemantic search services must be provided, being always up-to-date. Under theseconditions, this agent can be used to constantly track the site database, detectingchanges and triggering new indexing cycles whenever needed.

6.1.3 Data-access layer

The data-acces layer is deployed at the deepest level of the H-DOSE architectureand includes all the utility classes used for accessing and manipulating ontologies,for managing the platform persistence on different database servers and, for locatingresources. Although the technology adopted at this level is not particularly new,and the contained innovation is pretty low, this layer is a critical component forthe platform. In fact it performs all those tasks that require little intelligence but

97


that ensure the platform operations by giving access to the business objects, i.e. theontology, the resources and the storage of classification data (annotations). Threemain modules are deployed at this level: the ontology wrapper which provides pro-grammatic access to ontologies either written in RDF/S, DAML+OIL or OWL, theannotation database that defines a set of high level primitives for the persistentstorage of conceptual spectra, and a document handling sub-system which managesdocument-related issues such as fragmentation2, pre-processing, etc.

6.1.4 Management and maintenance sub-system

The management and maintenance sub-system, in the H-DOSE platform, has theduty of constantly monitoring the platform status and the responsibility to takeproper actions whenever failures occur. Rather than being a completely soundsystem it is, in its current configuration, mainly focused at resolving semantics-related issues and problems. In other words, although reliability and recoverabilityfeatures are critical for the platform deployment in real world scenarios, the currentsolution is only designed to maintain the annotation base coherent and constantlyup-to-date. However, to accomplish this task innovative, agent-based techniques areused, stemming from the Autonomic Systems research field.

The basic assumption on which this subsystem works is that the core componentof a semantic platform is the richness of the database into which semantic annota-tions are stored. The quality of search results, in fact, directly depends on thenumber and the quality of information stored in this semantic index (AnnotationRepository).

H-DOSE has the possibility, when needed, to automatically maintain the anno-tation base up-to-date by autonomously performing searches on the Web. This typeof autonomic maintenance has proved to be quite questionable, as resulted frommany discussion that took place in international conferences where the platform hasbeen presented. Besides the personal opinions, however, having a semantic platformthat autonomously perform searches on the Web and adds “non-verified” data toits knowledge base can actually be an unwanted behavior, especially for trust-basedor mission critical sites where all the available information must be certified by wellknown entities. In order to account these requirements, on one side, and at thesame time, to support the author research interests as well as the exigences of morecollaborative sites, the maintenance sub-system has been designed to be activatedor deactivated in the platform configuration phase.

Supposing that the autonomic features have been activated, they work follow-ing two self-management paradigms: “uniform topic coverage” and “user triggered

2Although fragmentation is supported by the platform design, in the current setting documentsare only manipulated as wholes. Fragment-level elaboration will be available in the next platformversions.

98


knowledge integration”. The “uniform topic coverage” paradigm aims at maintain-ing a certain degree of uniformness of the topic coverage in the repository. Thisimplies an automatic triggering of focused indexing processes whenever some top-ics have a low number of annotations. However, the paradigm can only improveknowledge that is already known, since the Annotation Repository knows only theexistence of concepts for which at least an annotation exists. For covering new topicsenrichment processes shall act at a higher logical level, where the ontology is known.

A transparent triggering of not-annotated concepts coverage can be achieved bymonitoring user requests at the Search Engine and by detecting the knowledge areasmodeled by the ontology, i.e., for which the system is able to provide a conceptualdescription, that are uncovered into the repository due to the lack of annotations.This second mechanism is called “user triggered knowledge integration”.

Uniform topic coverage has been implemented in to ways: the first one is basicallya search for the minimum occurrence of topics on the set of stored annotations.All covered topics in the repository are ordered by annotation occurrence and thelowest ten percent is selected as “low covered set” and provided to a newly designedenrichment agent that is charged of triggering the indexing of new, suitable resources.The second way involves some statistical considerations: basic indexes are computedfor evaluating the statistical properties of the topic coverage in the repository suchas: the mean occurrence value, the variance and the standard deviation. After theevaluation of these figures, a threshold based algorithm selects all the topics locatedunder a given fraction of the standard deviation, starting from the mean occurrencevalue, and triggers an enrichment cycle. The threshold value has been manuallyselected by performing different experiments and strongly depends on the shapeof the topics occurrence distribution, which is near-uniform only if the amount ofstored annotations is reasonably high.

Statistically speaking this technique tends to transform the topics coverage dis-tribution from an unknown non-uniform one to a uniform distribution in which alltopics have the same occurrence value. To implement “user triggered knowledgeintegration”, the Search Engine service is constantly monitored by the maintenancesub-system, trying to dynamically face changes in user habits and interests, and todiscover modeled topics currently uncovered by annotations. When a query is issuedto the search service, results are monitored in terms of relevance. If retrieved re-sources have relevance weights under a given threshold, or, worse, if no resources canbe provided as result, due to a lack in the Annotation Repository, the enrichmentagent, located in the maintenance sub-system, forces a new indexing cycle focusedon the uncovered conceptual areas.

Self-monitoring and self-optimization functions into the core services of H-DOSErequire the design and development of a new agent, called Enrichment Agent, formanaging the intelligent update of the Annotation Repository. This new agent shallbe able to discover new information for semantic indexing and to understand the

99


specification of a conceptual domain consequently focusing the resource selection,in order to reach a satisfying annotation coverage with respect to that area. Inother words, the enrichment agent accepts as input a set of concepts, performs someinternal operations which may involve collaboration with other agents, and providesas output a set of URIs that, when indexed, should generate annotations coveringthe topics received as input.

Figure 6.16. The sequence diagram of autonomic features in H-DOSE.

In normal operating conditions the enrichment agent stays idle, monitoring thebehavior of services such as the Annotation Repository and the Search Engine.If a critical situation is detected, the agent extracts, from the services, a list ofconcepts whose coverage should be enhanced and contacts the Synset Manager agent,in the indexing sub-system, to find lexical entities associated to such topics. Itsubsequently composes, textual queries to be issued toward classical, text-basedsearch engines. Once textual queries have been composed, two concurrent processesstart, one interacting with the Agency Manager in order to trigger incrementalindexing on already known sites and the second interacting with search web services(the Google web API [9], as an example) in order to retrieve a list of possibly relevantURIs. At the end of such processes the Enrichment Agent performs some filtering on

100

6.2 – Application scenarios

retrieved URIs, deleting not understandable resources such as “pdf” and “doc” files,in the current setting H-DOSE can in fact only support HTML, XHTML and plaintext. After the filtering process, the agent composes a list of resources to be indexed,identified by URIs, and sends an indexing request to the indexing sub-system whichsubsequently performs semantic annotation and updates the Annotation Repository;Figure 6.16 shows the corresponding sequence diagram.

6.2 Application scenarios

In this section are provided some examples on how the platform works when indexingor search requests are received. The well known representation method of sequencediagrams and interaction diagrams is adopted for explaining the various operationsinvolved by the two working scenarios.

6.2.1 Indexing

The indexing scenario is characterized by many interesting aspects because it in-volves several advanced techniques such as focused semantic indexing, code mobility,deep search and collaboration. Whenever a web resource or a set of resources mustbe indexed by the platform the indexing process starts, performing several stepsending in the addition of new knowledge to the platform KB (Figure 6.17).

As first, when a set of resources, identified by their URIs, is scheduled for index-ing, the agency manager agent divides the URIs by location. All resources publishedon the same Internet site become sets to be indexed by possibly different indexingsquads. After this “by site” separation the manager agent checks, for each locationif an existing indexing squad is available. If the check result is true, the correspon-dent set of resources is passed to the remote squad for “in site” indexing; instead, ifthe location does not hold an indexing squad but offers a suitable agent container,a new squad is created and sent over the Web, toward such location. In the case ofsites not offering agent containers, squads are migrated to a set of “friend” hosts inorder to balance the platform workload.

Once each indexing squad has been launched, the agents composing the squadstart to collaborate in order to perform semantic classification of required resources:if the web site under classification uses a known site-wise search engine, the squadsdirectly interface such engine by means of a “deep search” agent, accessing informa-tion stored in the site database, and thus providing classification of resources lyingin the so-called deep web. Otherwise normal indexing is performed.

The results of the semantic classification process are conceptual spectra whichare sent back to the agency manager, which, in turn, calls the Annotation Repositoryservice for persistently storing the extracted semantic information.

101


Figure 6.17. Interaction diagram for the indexing task.

At now, the indexing squads are able to classify every resource whose conceptualdomain overlaps the conceptual domain of the H-DOSE ontology. However someimprovements can be foreseen. For example, it is possible to design a new indexingsquad able to perform the so-called “focused indexing”. In focused indexing, a siteis firstly semantically analyzed, identifying to which part of the H-DOSE ontologyresources are relevant, then, a light-weight indexing squad, able to work only on theidentified ontology subset, is migrated to the same site. The advantage of doingsuch operation is to reduce the computational load required to host machines, forperforming indexing process, and to reduce the amount of information that shalltransit into the network, between the agency manager and the indexing squads.This advantage is balanced by a sensible increase in complexity for managing thesite semantic characterization and the focused squads, and to design effectivelycooperating agent colonies thus avoiding code replication on remote sites.

6.2.2 Search

The search scenario (Figures 6.19, 6.20) is organized as follows: an external appli-cation requests a search on the platform knowledge base by interfacing with the

102

6.2 – Application scenarios

Figure 6.18. Sequence diagram of the indexing task.

Search web service and by specifying a proper set of concepts. The Search service,firstly contacts the Expander service for obtained an extended version of the queryspectrum. Then it starts communicating with the Annotation Repository service forretrieving relevant annotations (i.e., document spectra). The Annotation Repositorytakes the query spectrum given by the Search service and searches the database inwhich annotations are stored to find relevant matches. If there are suitable resources,i.e., resources annotated as relevant with respect to the concepts occurring in thequery conceptual spectrum, the Annotation Repository service provides the set ofretrieved annotations to the Search service. Otherwise, if no resources are available,the Annotation Storage service throws an error that is caught by the Search service.

In the former case, i.e, when resources are available, the Search service checksthe similarity of retrieved spectra with the original user query (not the expandedone). Then it ranks the results according to the just computed similarity values andreturns to the caller application a list of URIs. The list is as long as specified bythe application in the search request.

In the second case, where resources are not available, the Search service man-ages the error by providing an empty list to the caller application. At the sametime, it notifies the maintenance and management sub-system, and in particularthe enrichment agents, that some concepts of the H-DOSE ontology are not coveredby annotations. This notification, in turn, triggers the autonomic features of theplatform that start a new indexing cycle, which possibly “repairs” the source of thesearch error. Figure 6.16 shows how the H-DOSE autonomic features can achievethis result.

103


Figure 6.19. The search interaction diagram.

Figure 6.20. The sequence diagram of the search task.

104

6.3 – Implementation issues

6.3 Implementation issues

The H-DOSE platform has been fully implemented in Java, both for web servicesand for collaborative agents; the last ones have been developed by adopting theJADE framework [16].

The JADE software framework has been developed by TILAB (formerly CSELT)and supports the implementation of multi-agent systems through a middle-ware thatclaims to be fully compliant with the FIPA specifications [17], in order to be ableto inter-operate with other FIPA compliant systems such as Zeus (British Telecom)[18] and FIPA-OS [19]. JADE agents are implemented as Java threads and livewithin Agent Containers that provide the runtime support for agent execution andmessage-handling. Containers can be connected via RMI and can be both local andremote; the main container is associated with the RMI registry. Agent activities aremodeled through different “behaviors” and the execution is managed by an agentinternal scheduler.

The execution environment in which Web Services are deployed and run is theApache Tomcat servlet container [20], integrated by the Apache Axis SOAP engine[21]. They are both developed in an open and participatory environment, partof the Jakarta project [22] and released under the Apache Software License. Thedescription language used to publish Web Services is standard WSDL and it isautomatically supported by the Axis SOAP engine.

H-DOSE modules are pure-Java services released under the LPGL license forwhat concerns the newly developed functionalities. Many libraries are used by theplatform services:

• the ontology wrapper uses the HP Jena library [23] for accessing RDF/S [4],DAML and OWL [5] ontologies;

• the persistent storage module uses the PostgreSQL jdbc driver [24] for inter-facing a PostgreSQL database server in which spectra are stored;

• the document-handling service, which is currently under development, uses theSaxon API [25] and the JTidy API [26][27] for performing the pre-processingof textual resources;

• the expander service is powered by the JGrapht API [28] that handles all theissues related to the modeling of the H-DOSE ontology as a directed, weightedgraph;

• the Sun JAX-RPC [29] library is used by all web services to implement com-munication methods while all the platform agents use the jade library forimplementing their behaviors and communication methods.

105


H-DOSE is currently at the second beta release, namely “hdose2.1”, and it ispublicly available at the sourceforge.net3 site. The platform is a constantly evolvingopen source project: at now, two main streams of research are pursued. The firstone is designing the evolution of the H-DOSE platform for multi-media applications,while the second analyzes the requirements and modifications needed by the plat-form to be applied in the field of competitive intelligence. At now some third partyorganizations have expressed their interest in experimenting with the platform, inparticular the IntelliSemantic company is actively working on the platform appli-cation to semantics-aware search systems, in the field of patent management anddiscovery.

3Sourceforge is one of the most famous open source software repositories available on the web,it can be reached at http://www.sourceforge.net. The H-DOSE web site, on sourceforge, canbe found at http://dose.sourceforge.net

106

Chapter 7

Case studies

This chapter presents the case studies that constituted the benchmark ofthe HDOSE platform. Each case study is addressed separately startingfrom a brief description of requirements and going through the integrationdesign process, the deployment of the HDOSE platform and the phase ofresults gathering and analysis.

As part of the work presented in this thesis, the H-DOSE platform has beenused for adding semantic search functionalities to 4 already deployed web applica-tions. The first is a legacy publication system written in PHP and owned by thePassepartout service of the city of Turin. The second is a well known and widelyadopted e-Learning environment: Moodle [1]. The third is the case-based e-learningenvironment developed in the context of the European project CABLE [3] and thelast is a transparent search engine built on top of the Muffin intelligent proxy [2].The following sections will address the integration of semantic functionalities intothese applications by using H-DOSE as semantic back end.

7.1 The Passepartout case study

The Passepartout service of the city of Turin, is a service of the Turin’s munici-pality that provides information about disability aids, norms and laws in the Turinmetropolitan area. To accomplish its informative mission, the Passepartout pub-lishes and maintains a web site in which disabled and elderly people can find relevantinformation, allowing them to effectively interact with the public health institutions,and to access the facilitations and services they are entitled to.

The Passepartout informative system is based on the principles of distributedredaction and separates the process of content editing, done by journalists, andthe review and publication processes, done by redactors. The whole work flow,

107

7 – Case studies

starting from the content creation to the final publication on the web, is managedand supported by a legacy system written in PHP.

In the context of a collaboration between the author’s research group and thePassepartout service, this publication system has been integrated with semanticfunctionalities by means of the H-DOSE platform. Currently the semantic function-alities supported by the Passepartout web site include the manual classification ofpublished documents, the search by category and the semantic what’s related (re-quirements 4, 5, 6 and 8). The integration has been deployed as follows: as first thelegacy system has been extended for supporting the exchange of SOAP messageswith the semantic platform. A previously-built module for SOAP communicationhas been included in the system libraries and a new PHP module for managing thecommunication with the semantic platform has been developed. This last module isbarely composed by a set of function wrappers, for the services offered by H-DOSE,and adds, where needed, the intelligence required to perform more complex tasks.

Secondly, the template for publishing pages has been integrated to include thesemantic what’s related functionality: each published page has been integrated witha link pointing to the semantically-related pages. When a user clicks that link, theintegrated system interacts with H-DOSE using the what’s related search functionthat takes as input a resource URI and provides as result a set of ten other resources,which are conceptually similar to the starting one.

As a third step, a semantically populated directory has been built in PHP; sucha directory is pretty similar to classical directory systems such as the yahoo’s one orthe dmoz.org open category. However, the resources belong to categories dependingon their conceptual description. As a consequence they could occur, at the sametime, in different category branches. Categorization is, in fact, directly related to theontology that defines the knowledge domain in which the system works: disabilityin this specific case.

From the implementation point of view, the category tree for this test case hasbeen built off-line and then included into the Passepartout system. This choice isonly related to performance issues: building the same tree at runtime means thatevery time the category is required, an ontology full navigation shall be performed.Since the ontology is usually big, in this small scenario includes more than 80 con-cepts and 20 different relationships, the time for page composition quickly grows toohigh and does not satisfy usability criteria.

Eventually, the Passepartout publication system has been modified to supportthe conceptual classification of pages. This last modification has a little or not im-pact on the publication process. As the system originally included a site-wise searchengine based on manually specified keywords, the publication interface can be fullyretained whereas the keyword choice is now limited to the set of ontology concepts.This simple solution allows to keep the interference of semantics introduction intothe usual work flow negligible, thus reducing the time required for the integration

108

7.1 – The Passepartout case study

of the new functionalities and, as a consequence, reducing the possible complaintsabout the system complexity and usability. Figure 7.1 shows the final deploymentof the integrated system.

Figure 7.1. The passepartout system.

7.1.1 Results

In order to extract relevant information about the effectiveness of the proposedapproach to semantics integration, a full test plan has been set-up that involvethree different test groups. Groups are devoted respectively to the evaluation of theconceptual representation of the Passepartout domain and of the semantic functionsnewly introduced into the publication system, to the evaluation of the performanceof the semantic modules in terms of standard measures such as precision, recall andF-measure and, to the evaluation of the effort required to use the new conceptualbased interfaces, i.e. the usability evaluation.

The first tranche of tests has been called “ontological tests” and involves threedifferent sessions. In the first session, the ontology, developed in collaboration withthe Passepartout (about assistive services for disabled people; 80 concepts and 20different semantic relationships), is tested for completeness, i.e., it is tested theability of the ontology to cover the Passepartout conceptual domain. In this test atleast six different redactors are required to conceptually classify a set of ten pages,pseudo-randomly selected: a minimal amount of overlap between the pages testedby different operators is granted in order to obtain suitable and comparable results.

109

7 – Case studies

For each page the classification is captured onto paper sheets and the redactorsopinions are collected too. The aim of the test is to analyze whether or not theontology is complete enough to cover the contents published by the Passepartoutservice.

The second session aims at identifying inconsistencies into the ontology modelby performing focused classification tests, i.e. classification of resources belonging tothe same conceptual area at a different granularity. As in the first session, at least sixredactors perform the test and are required to provide both the pages classificationand opinions about the ability of the ontology to describe resource contents.

The second session is different from the first one by aims and scope, in fact,while the first session aims at evaluating if, given a uniform distribution of contents,the resulting classification is also uniformly distributed, the second session aims atevaluating if, given a general topic and a uniform distribution of content granularity,the resulting classification is uniformly distributed in deep, i.e. if the classificationresults are uniformly distributed among the nodes descending from the given generaltopic, in the ontology hierarchy.

The last session has the objective of identifying the ontology areas that are poorlymodeled, i.e. the conceptual areas in which there are multiple collisions of resourceclassifications. For collision, it is intended the coexistence of several annotations be-tween indexed resources and a given concept. If the number of collisions significantlyexceeds the ontology mean value, it is likely that the ontology has a modeling prob-lem for the concept under examination. As an example because multiple, differentconcepts have been modeled as a single one.

The second group of tests is aimed at evaluating the effectiveness of the semanticsearch functionalities in terms of precision (p) and recall (r) and of their combinationin the F-Measure (F = 2·r·p

r+p).

The sessions involved are three, as in the previous case, referred respectively tothe “what’s related” function for the first session and to the directory search forthe last two sessions. In the semantic “what’s related” test, the semantic system ischecked for both search and classification effectiveness: as first the degree of simi-larity between the conceptual descriptions of the starting page and of the retrievedpages is evaluated and secondly, the correctness of page annotation is evaluated.

In such a test, the operators involved are domain experts, the redactors of thePassepartout site, therefore, although the evaluation still remains subjective, it issupposed to be significant. The test is deployed as follows: each operator is pro-vided with a set of 5 pages extracted pseudo-randomly from a test set composed of20 different pages. He/She is required to browse the Passepartout site, to reach thepages in the set and select, for each of them, the link “related pages” that activatesthe semantic “what’s related” functionality. For each page the operator shall com-pare the starting page and the retrieved pages. The similarity between pages shallthen be evaluated and reported on a proper test sheet. On the same test sheet, the

110

7.1 – The Passepartout case study

operator is also required to report his evaluation about annotation correctness, i.e.to report if pages are correctly classified according to his knowledge of the domain.

The directory search test is further subdivided in two steps: in the first stepthe ability of the system to retrieve all the relevant information and no other istested. This case represents a quite unusual operation scenario that is mainly usefulfor testing the platform functionality: the user can only select a concept in thesite directory and the system is required to retrieve all the resources indexed asrelevant to that concept (in this test the expander module of the H-DOSE platformis disabled).

Theoretically, assuming that the classification process has been well performedall relevant resources should be retrieved and only them. In terms of precision andrecall measures, both values should be 100%. However, the classification process isnever perfect, therefore, the retrieved results will possess lower values of precisionand recall; these values will be as near to the maximum as the manual classificationhas been well performed.

On the other side, assuming that the classification is “perfect”, since it is providedby domain experts, this test allows to detect possible problems in the platformoperations. Practically, the test involves 10 operators , each required to perform3 predefined searches (different for each operator). The search results are reportedby the test operators onto proper test sheets and the collected results are thenelaborated and organized.

In the second step, the directory search is tested in its full potential: the testoperators are allowed to select as many concepts in the directory as they like. Suchinitial concept specification is then expanded by the H-DOSE platform using thenot-explicit knowledge modeled by the platform ontology and finally, resources arecompared to the resulting conceptual specification, ranked, and provided back tothe user.

The test involves ten different operators, each of them is asked to perform a givensearch task, described by means of a goal statement, and to evaluate the relevanceon retrieved pages. Recall, precision and the F-measure are evaluated on the basesof the evaluation results collected from all the operators.

Finally the last group of tests is designed to evaluate the additional effort requiredto content editors (journalists) and to redactors, to include the semantic informationsrequired for the site operations. This kind of evaluation has been performed as aset of interviews to the Passepartout site crew, which has been required to use theintegrated system for a month. It is important to notice that the usual publicationload is of about four new pages published per day and per redactor, that means atotal amount of around 500 operations in a month.

Unfortunately, as it is easy noticeable by the absence of tables and graphs, theenvisioned time frame for evaluations has been widely trespassed, and, at now, thefirst test phase is not yet completed. The other phases, are still to be started.

111

7 – Case studies

These problems are not related to the platform deployment; in such sense somepreliminary results have been collected which seem to indicate a relatively simpleintegration. Instead, the main problems are related to communication issues betweenthe research group and the Passepartout crew and to the coincidence of differentsituations (such as the 2006 winter olimpic games, held in Turin) that prevented thetimely execution of the test program. Nevertheless results are slowly being collectedand, once complete, they will be submitted, as a paper, to an international journalon semantic web technologies and applications.

7.2 The Moodle case study

Moodle is a course management system (CMS): “a free, open source software packagedesigned using sound pedagogical principles, to help educators create effective on-line learning communities”. The design and development of Moodle is guided bya particular philosophy of learning, a way of thinking that is usually referred to,in shorthand, as a “social constructionist pedagogy” . The courses provided byusing Moodle as e-Learning environment can be about whatever argument a teacherreputes valuable and interesting for the students enrolled in such a course. However,the availability of a semantic classification of the course contents is really helpfulfor both teachers, as an example by providing support for the automatic definitionof different learning paths, personalized per each student according to the alreadyaccomplished learning tasks, and students that could easily access the courses aboutthe topics they are interested to.

This motivation has fostered the application of the H-DOSE approach to se-mantics integration to the Moodle CMS with the intent to demonstrate, from oneside the general applicability of the proposed solutions and on the other side theadded values that semantics integration can bring to an e-Learning environmentsuch as Moodle. The integration process was deployed as depicted in the followingparagraphs.

As first the design principles lying at the basis of the Moodle system and espe-cially defining the way Moodle handles course information, have been analyzed inorder to understand how semantics can be integrated into the environment. As aconsequence a new PHP page has been introduced into the system, containing allthe relevant queries to the Moodle database, for extracting the course contents tobe subsequently classified using the H-DOSE platform.

Then a new Moodle module for allowing the automatic classification of courseshas been introduced, following the guidelines provided by the Moodle authors forthe development of new modules. Such a module, allows, by simply clicking abutton, to index all the resources available for a given course, with respect to agiven knowledge domain. The knowledge domain is the one defined by the ontology

112

7.3 – The CABLE case study

used by the H-DOSE platform for semantic classification.When a course teacher, selects the “semantic classification” button, on the ad-

ministrative interface, the Moodle module queries the course database, by means ofthe afore mentioned PHP page, and sends such information, properly formatted astext, to the H-DOSE platform, that performs the final semantic indexing. From thismoment on, the course has been conceptually classified. Users are then providedwith a new search function that integrates the ones already available in Moodleand that strictly resembles the “search by category” interface of the Passepartoutservice.

Some qualitative tests have been performed about the H-DOSE integrability intoweb applications, evaluating the effort required for such a process in terms of theamount of man hours needed for developing the Moodle semantic module, Table 7.1shows the results.

Phase Man hoursMoodle DB analysis 40 hrPHP query page 4 hrSemantic Classification module 6 hrSearch by category interface 8 hrTest & debug 20 hrTOTAL 78 hr

Table 7.1. Man hours required for the Moodle semantic integration.

Moreover the advantages and improvements that semantic inclusion brings to ane-Learning environment are also under evaluation. These tests are only in the verypreliminary phase therefore, they are neither sound and valuable as a demonstrationnor they are reported here, but they are useful for capturing a taste of what servicescould be offered for learning environments, and for a subsequent phase of require-ments gathering and analysis, leading, possibly, to improvements in the H-DOSEplatform.

7.3 The CABLE case study

The primary goal of the CABLE project is to develop methodologies that enable e-learning tools supporting educational operators, e.g. including learning facilitators,primary and high-school teachers, university professors, schools for social operators,etc.. The specific characteristics of this learning group prevents the adoption ofclassical e-learning environments: educational operators need a learning approachextensively based on growing personal experience (recall and elaboration), knowl-edge of other parallel experiences (exchange, communication, comparison) as well

113

7 – Case studies

as the incorporation of theoretical studies and contributions (continuous update,life-long learning).

The e-learning system should facilitate and encourage students to a personalelaboration of learned material, to achieve a higher degree of awareness in theirprofessional actions. The CABLE project addresses this problem by exploiting alearning approach based on the availability and exploitation of a pre-existing exten-sive archive of case studies which can be termed as a community memory, sometimesreferred to as an organizational memory. This community memory is, by its nature,extensible and dynamic.

CABLE postulates that only through the study and comparison of other people’sexperiences, interpreted in their specific context, personal knowledge and skills,critically, in this domain, including intervention skills, may be improved. The fullacquisition of new knowledge can be strengthened by seeing linguistic and culturaldifferences as resources rather than barriers.

The learning approach is therefore based on the development of synthetic, com-posable didactic modules which are based on a narrative language structure, andlinked to more analytical in-depth material composed of theoretical contributions,normative references, context descriptions, etc. Each didactical module is associatedto a set of real-world on-going case studies, that will be used as teaching examples,as materials upon which to develop and sharpen interpretation abilities, as contactpoints for social interactions, and as a basis for self-assessment and self-evaluation.

The e-learning system supporting the methodology proposed by CABLE is builtaround two core entities, namely the case studies and the didactical modules. Usersof the system are students, authors of didactic modules, and contributors of new casestudies. The experiences and implicit knowledge in case studies, and the explicitknowledge contained in the didactic modules, need to be handled in an intelligentway by the system, in order to discover relationships, shared concepts, learningpaths, etc. As a consequence, case studies and didactic modules are categorizedby formal metadata, resorting to domain specific ontologies and semantic networkhierarchical conceptualizations. Classification of case studies and didactic modulesare dynamic processes, influenced by interactions and feedback with users.

The on-line courses therefore shall adapt themselves automatically to new casestudies, new didactic modules availability, improved classification or description ofexisting case studies, and emerging common practices. Case studies may be textualor multi medial.

7.3.1 System architecture

CABLE is both a project and a framework for supporting the learning methodolo-gies experimented and developed during the project execution. The framework has

114


a well defined ICT infrastructure, which reuses as much as possible already avail-able, effective solutions, avoiding to reinvent the wheel. Three main componentsemerge, composing the basic structure of the CABLE architecture: an e-Learningenvironment, a repository of case studies and good practice examples, and a se-mantic module able to leverage formal metadata associated to both learning objectsand case studies for composing and discovering associations between courses andgood practice examples. As the domain of application for the CABLE frameworkrequires growing experiences of users and teachers by allowing comparison and shar-ing of similar case studies and solutions, the semantic module has the responsibilityof automatically establishing correspondences between new learning paths and ex-isting case studies as well as the capability to correlate, at runtime, newly addedcase studies to the already existing cases and learning modules. Figure 7.2 showsthe logical organization of the CABLE framework.

Figure 7.2. The CABLE framework logical architecture.

The VLE module, is the Bodington [30] learning environment, a cutting-edge,open source, e-Learning System widely adopted by UK universities (e.g., UHI [31],etc.), it is entirely developed in Java and runs on top of the Apache Tomcat servletengine. The case studies repository, instead is a Java web application developed fromscratch during the project execution. Finally the semantic module is implemented bya customized, minimal version of the formerly introduced H-DOSE platform namedmH-DOSE (minimal H-DOSE).

The basic interaction flow is designed to be two folded, i.e., to support twodifferent information needs: finding case studies from a well defined learning resourcein the VLE or finding case studies relevant with respect to a significant example(search for related case studies). In Figure 7.2 these two operational paradigmscorrespond, respectively, to the left most and to the right most user interfaces. Ascan easily be noticed both processes are mediated by the semantic module, whichlies in the middle.

115

7 – Case studies

The two interaction paradigms can also be described in form of interaction dia-grams. In the first case (Figure 7.3), a user, or a teacher, uses the VLE to participatein some learning activity. At a time, in the e-learning process, case studies shall beanalyzed for better understanding how to tackle a given pedagogical scenario. Asthe CABLE framework hosts many case studies provided by several entities, in aEurope wide environment, the resources relevant to the learning module are ex-tracted from the cases repository at runtime. Among other advantages, this allowsto automatically take into account newly added knowledge, in a transparent way.Therefore, following the user request, the VLE contacts the semantic module forrelevant case studies, providing, at the same time, a conceptual description of thelearning object viewed by the user. The semantic module retrieves from the casestudies repository all descriptions of case studies that match, at least partially, theVLE specification. Then, by applying ontology navigation techniques, it ranks theretrieved results and provides back to the VLE a list of URLs of good practice ex-amples. The VLE retrieves the case studies and presents them to the user, in aconvenient way.

Figure 7.3. The “VLE to case studies” interaction diagram.

In the second scenario, instead, the user is already accessing the case studies

116


repository, as an example for consulting a well known solution to specific pedagogi-cal issue. Once read how the solution had worked and what is the scenario in whichsuch solution has been applied, the user might want to find whether the just learnedapproach has been successfully applied to other, similar situations. He/She selectsthe “related case studies” button on the user interface for retrieving similar cases.The button pressing causes, inside the repository, the retrieval of the conceptual de-scription of the case currently viewed. Such a description is passed to the semanticmodule which, in turn, extracts from the repository a set of candidate case studies,on the basis of the initial semantic specification, taking into account ontology rela-tionships and concepts. Then, as in the former scenario, the semantic module ranksretrieved results and provides a list of “relevant” URLs to the repository application.The repository retrieves and organizes the relevant case studies and presents themto the user, as a result.

Figure 7.4. The “case study to case studies” interaction diagram.

As can easily be noticed, in both cases there are no predefined matches be-tween case studies, or between case studies and learning objects. Instead they arediscovered at runtime, by comparing the respective conceptual descriptions. As thecomparison is ontology-driven, not explicit associations can easily be discovered thusleveraging the power of semantics for providing conceptually relevant results (which

117

7 – Case studies

are hopefully more relevant than the ones that would be extracted by applyingsimple keyword matching techniques).

7.3.2 mH-DOSE

In order to implement the functionalities required to the semantic module of theCABLE framework, the H-DOSE platform has been strongly customized, removingall the not required functionalities and modules. This customization resulted in alight-weight platform using only the search engine and the expander module. Thistwo modules have not been modified, since they already provided the required func-tionalities. In particular, for supporting the CABLE search processes, it has beenadopted the “search by concept” function, which uses expanded spectra for resourceretrieval and query spectra for resource ranking. What changes, with respect to thecomplete H-DOSE, is that in CABLE the search engine is directly interfaced to thecase studies repository. However, also this aspect has not required any modifica-tion to the original platform modules, as the interface provided by the case studiesrepository has been designed to be perfectly compatible to the one offered by theannotation repository in H-DOSE.

Figure 7.5. The mH-DOSE platform adopted in CABLE.

The overall adaptation process has taken no more than one day for adaptationand one day for testing, thus demonstrating the easy integrability of the H-DOSEplatform in other web applications.

118

7.4 – The Shortbread case study

7.4 The Shortbread case study

The system described in this section offers a semantic-based what’s related func-tionality whose presence remains hidden to the user, at least until no relevant in-formation can be found. In order to understand whether this goal can be reachedor not and how to effectively tackle the problems related to semantic retrieval ofrelated resources, the web navigation process has been analyzed. The interactionscenario follows: the user surfs the web using a browser. For each page requestedby the user, the browser contacts a given server or a given proxy in order to fetchthe proper content. Then it interprets the received page and shows the content,properly formatted, to the user (Figure 7.6).

Figure 7.6. Simple access to web resources.

As can easily be noticed, the interaction scenario involving a HTTP proxy (Fig-ure 7.7)is very well suited for introducing transparent functionalities in the usernavigation process. The proxy, in fact, intercepts both user requests and serverresponses, and can be exploited as access point for the semantic what’s related sys-tem. For each incoming request, the page returned by the web server is analyzed,semantically classified if necessary, and its semantic description is extracted. Thedescription is, in a sense, a snapshot of the user willings in terms of informationneeds, at the semantic level. Such a snapshot is combined into a user model that,under certain conditions can drive the retrieval of related pages, i.e., of pages se-mantically similar to the user needs as modeled by the proxy. Every time a newrequest from the user’s browser is received, the user model is updated and possiblythe related pages are retrieved. Then, a new page is automatically composed as thesum of the requested page and of a list of suggested resources, and it is finally sentback to the user browser.

The user can virtually be completely unaware of the search system located inthe proxy and can, in principle, think that the received pages are exactly the oneshe requested by typing a URL on his/her browser or by clicking a link on a webpage. In such a case the what’s related system is actually “transparent”.

119

7 – Case studies

Figure 7.7. Proxy-mediated web access.

7.4.1 System architecture

The proposed system is logically organized in two main functional blocks as caneasily be inferred from Figure 7.8: an intelligent proxy, and Shortbread, a semanticwhat’s related system. The proxy is actually a preexisting proxy available underan open source license and does not constitute a novelty point. The novelty con-tribution, instead, can be found in the way the proxy has been used and in theShortbread system.

In the proposed setting, rather than working as a simple relay, the proxy worksas a programmable switch: when Shortbread has no information to propose to theuser, the proxy acts nearly as a normal proxy, by simply relaying information andby spilling out the URLs of the requested pages (3). These URLs are forwarded tothe underlying semantic what’s related system. Otherwise, if Shortbread is able tofind information which is semantically related to the user needs, as extrapolated bythe user navigation, the proxy acts like a sort of mixer by adding to the page thatis coming from the remote server (2), a list of possibly relevant resources (4) thatthe user can choose during his/her navigation.

The core of the proposed approach is clearly Shortbread. This semantic what’srelated system basically includes 4 different modules: the User Profile (U.P.), the Se-mantic Information Retrieval system (S.I.R.), an XML-RPC to SOAP gateway anda semantic platform for managing the knowledge acquired during the web navigation(H-DOSE).

The User Profile extracts and stores information about the user navigation, inorder to perform searches for information semantically related with the user needs.Practically speaking, for each page requested by the user through his browser, theU.P. obtains the corresponding URL and tries to extract the semantic characteri-zation of the resource using the S.I.R. If the URL has already been classified (andit description stored in the H-DOSE platform), the S.I.R. provides to the U.P. thesemantic description of the resource in form of “conceptual spectrum”. Otherwise,the resource is inserted in the classification queue of the H-DOSE platform.

Whenever a semantic characterization for the current page is available (i.e. the

120


Figure 7.8. The Shortbread architecture.

resource description was already present in H-DOSE), the U.P. updates its inter-nal user model by adding the knowledge related to the current page. This processallows the system to incrementally extrapolate the user information needs, whilethe model becomes more and more accurate at each new navigation step. Clearly,the user model is accurate if and only if the user navigation is longer than a givenamount of pages. The U.P. has therefore some policies to prevent the system fromperforming suggestions when the model is poorly accurate, avoiding to provide dis-turbing information to the user. Once the model becomes accurate enough, the UserProfile can query the Semantic Information Retrieval system for resources semanti-cally related to the user model. In other words, a snapshot of the user model (i.e.,a conceptual spectrum) is passed to the S.I.R. which, in turn, requires H-DOSE toprovide the URIs of those resources whose conceptual spectrum is reasonably similarto the received user model. These URIs are then forwarded to the proxy which mixesthe suggested information with the currently elaborated page, thus providing to theuser a complete page composed by the original content plus the newly generatedlinks to related pages.

Referring to Figure 7.8, the arrow labeled (4) exists, therefore, only when the

121

7 – Case studies

user model is accurate enough and there are resource descriptions available in theDOSE platform. Otherwise only the arrow (5) exists, meaning that the proxy isactually working as a simple relay.

The Semantic Information Retrieval system is basically a query generator forthe H-DOSE semantic platform. It has the ability to compose, from one side, re-verse queries, i.e. queries that starting from a simple URL require the semanticdescriptions associated to the page identified by the URL. On the other side it hasthe capability to take as input a conceptual spectrum, which encodes the model ofuser needs, and to perform a search for those resources having similar conceptualdescriptions, using H-DOSE. Eventually, the S.I.R. has also the ability to trigger theH-DOSE indexing whenever a reverse query fails, meaning that the page currentlytransiting through the proxy has not yet been classified.

The XML-RPC to SOAP gateway is a protocol translator allowing simple con-nections between the Semantic Information Retrieval system, which uses XML-RPC,and the H-DOSE platform which is based on web services and uses SOAP as com-munication protocol.

7.4.2 Typical Operation Scenario

The Shortbread system has been conceived as a personal proxy: the system is, inother words, designed to work on the user’s PC, in background. This hypothesis hasseveral advantages and also allows some simplification in the system design. Short-bread has mainly one functionality, i.e. to provide links to resources semanticallyrelated to the user navigation, which, in some sense, is assumed to be an indicatorof the user information need. To operate in a semantic fashion it needs an ontol-ogy that defines the knowledge domain in which the user performs searches. Suchan ontology is a foundational block in the H-DOSE platform allowing the semanticindexing and retrieval of resources.

Designing the what’s related system for working on the user PC has allowed tocompletely ignore the issues related to user permissions, authentication, ontologyswitching and so on. For each user of a given PC, a specific instance of Shortbreadcan be run, allowing personalized navigation and transparent searches, even ondifferent domains. Moreover, a simple persistence mechanism permits to store theuser model , so that it has the chance to become more an more accurate at each newnavigation session. In addition, since sometimes a repeated navigation pattern canmake the user model too specialized, a utility function is also available for resettingthe model to a given previous state.

Conversely, this design choice can sometimes be restricting, especially on old PCs,because the H-DOSE processes can be very computationally expensive. However,the H-DOSE component was originally designed to be a distributed, server-sidesystem, and already possesses the functionalities for being used from remote. In

122


particular it allows many multiple and concurrent connections from remote clientsrequiring different tasks (search or indexing). Therefore, whenever a user reputesnecessary to not run H-DOSE on its personal machine, Shortbread can be configuredto access a remote H-DOSE server and to use such server as the semantic backboneof the system.

Currently there are still some open issues when working remotely with H-DOSE:user authentication, user management and ontology switching functionalities are infact very preliminary and do not allow an efficient and secure management of userrights and information. In other words, the current version of the H-DOSE plat-form does not limit the user to see only the information he/she is entitled to, butoffers the same visibility to all the users connected to a given H-DOSE server. Thisbecomes a critical issue if the user navigation is about confidential topics/resources.The author’s research group is actually concerned with such issues and, in the forth-coming new version of the H-DOSE platform, these issues will be tackled allowingfor a more effective and secure operations. As a last remark it must be noticed that,due to the modular nature of the approach and to the backward compatibility ofH-DOSE, the entire system will be able to work on future versions of the platformwithout performing changes on the other modules.

7.4.3 Implementation

The system has been implemented in the Java programming language. This choice ismainly related to the availability of a programmable proxy written in this language,which is called Muffin [2], and to the fact that also the H-DOSE platform has beendeveloped in Java. In more detail, Muffin is a programmable Java proxy based onthe notion of filter. A filter can act either on an entire page, as a whole, or ontosingle chunks (tokens) of the HTML code that represents the page itself. Page-levelfilters can act both on the path between the user browser and the remote server,and on the inverse path. Token filters, instead, can only work on the return path.

The proxy component of ShortBread has been implemented as a token filter inMuffin. Basically, it sniffs the URLs of the pages requested by the user while theyare on their way back to the browser. Then, if some suggestions are available, itmodifies the HTML code of the transiting page by adding a “what’s related” sectioneither at the end or at the beginning of the page. The Shortbread core, constitutedby the User Profile, the Semantic Information Retrieval system and by the XML-RPC to SOAP gateway has been developed as a set of separate Java classes whichare included in the muffin filter and thus executed as part of the muffin process.

H-DOSE, instead, is completely separated from Shortbread and it is based onboth web services and intelligent agents. It runs on a proper servlet container(Apache Tomcat) and can be distributed among several hardware devices in orderto maximize its performances. In the preliminary experiment setup ShortBread has

123

7 – Case studies

been deployed on a pentium class PC equipped with 512MByte of memory and40GByte of disk space. The disk space occupation of the system is very low and thememory load nearly negligible. H-DOSE, instead has been deployed on a separatemachine equipped with a AMD Athlon XP processor (2200+) , 1GByte of RAMand 120GByte of disk space. The DOSE impact on the PC performances is low interms of occupied space, around 20MByte, but is quite evident in terms of memoryload, where the maximum load can reach 40MByte.

124

Chapter 8

H-DOSE related tools and utilities

This chapter is about all the HDOSE related tools and methodologies de-veloped during the platform design and implementation. They include anew ontology visualization tool, a genetic algorithm for semantic anno-tations refinement, etc.

During the H-DOSE design and development many side-issues have been ad-dressed, being related with the main stream of research. In particular two toolshave been developed, respectively referred to the problem of annotation genera-tion/storage, and to the problem of ontology creation/visualization.

The former research effort is an outcome of the firsts platform experimentations.Doing automatic annotation of documents, the corresponding spectra resulted oftenvery dense, i.e. they usually involved many concepts in the ontology. This feature issometimes undesirable, since it retains a great amount of redundancy and may hostseveral annotation errors, which get easily undetected by being buried in many lowrelevant annotations. In order to tackle such an issue an evolutionary refinementmethod for annotations has therefore been developed, trying to eliminate as muchredundancy as possible, while maintaining, at the same time, the maximum anno-tation relevance. Evolutionary techniques have been chosen since this is a classicaloptimization process where a trade-off between annotation conciseness and relevancemust be reached, and these techniques are known to be well suited for this kind ofproblems.

The latter tool, instead, is related to the H-DOSE experimentation in the Passep-artout and in the CABLE test cases. One of the preponderant aspects that comesout from these case studies is that often, even before understanding the valuesadded by semantics integration, non technical people do not understand the formalknowledge model. They are in fact expert in the domain, but not in formalizationtechniques. As a consequence, many times modeling errors remain undetected sinceRDF or OWL are too cumbersome to be understood by non-experts.

125

8 – H-DOSE related tools and utilities

This situation has promoted, in the author’s research group, several insightfuldiscussion on how make more clear the formalization commitments in ontologies.Such discussions ended-up in a new ontology visualization tool able to representconceptual models as enhanced trees in a 3D space. The tool has been experimentedin the CABLE final meeting as well as at the SWAP2005 conference and proved tobe quite useful. In almost all cases comments have been about the visualized modelsand their correctness rather than on the tool interface and navigation paradigms.This is clearly not a demonstration of effectiveness, however it gives some goodfeedback on the tool ability to ease the process of ontology design and revision,especially when such process is performed by not technicians.

8.1 Genetic refinement of semantic annotations

Semantics integration, on the current web, requires the ability to automatically asso-ciate conceptual specifications to web resources, with minimal human intervention.The huge number of available resources, and the actual presence of the so-calleddeep web, not accessible to user navigation, makes manual annotation of the wholeweb infeasible. Therefore tools are needed to link web resources with semanticdescriptions referred to domain ontologies.

Relevant information can usually be extracted from on-line textual resourceshowever, it is, sometimes, only available in form of audio and video content, thus suchformats should also be considered when an annotation process is designed. Semanticannotation requires the definition of the sequence of information extraction andanalysis operations that are needed to describe a resource in a conceptual manner.The process input is composed of resources identified by URIs, that are retrievedand elaborated to produce a set of semantic descriptors pointing at them.

Several approaches have been proposed for automatic semantic annotation, stem-ming from two main research fields: natural language processing [32] and machinelearning [33]. Natural language processing (NLP) is based on the detection of typicalhuman constructs in textual information and tries to map known sentence compo-sition rules to semantics rich descriptions. As an example, using NLP the sentence,“Peter goes to Rome” is parsed into a tree structure like the one in Figure 8.1.

Such tree could then be mapped onto a semantics rich representation by usingconcepts and relationships defined in an ontology and by associating tree componentsto semantical entities (Figure 8.2).

Nowadays search engines could be seen as very simple NLP annotators whichextract pseudo-semantic information from the occurrence of keywords into web re-sources. Semantically speaking, the basic form of an NLP based resource annotationcould therefore be identified as a process that starting from the occurrence of cer-tain words in web documents, generates a set of concepts to which the document

126

8.1 – Genetic refinement of semantic annotations

Figure 8.1. The grammar tree for “Peter goes to Rome”. (NP = noun phrase,N = noun, VP = verbal phrase, V = verb, PP = prepositional phrase, PREP =

preposition)

is related, i.e. to which such words have been associated by human experts and/orautomated rule extraction.

Figure 8.2. Triple representation of the sentence “Peter goes to Rome”.

The other way to provide semantic descriptors currently under investigation ismachine learning. Basically machine learning means extracting association rulesand behaviors to allow machines to accomplish specific tasks as well as their humancounterparts. In the Semantic Web, machine learning principles are applied inorder to learn how human beings classify homogeneous sets of resources and toimitate such behavior in an unsupervised process. Learning semantic associationrules implies several assumptions: as first, knowledge domains should be limitedand possibly self-contained in order to be able to provide significant training sets.Secondly a significant amount of resources annotated by human experts are neededin order to extract association rules reliable and at the same time expressive enough,that is to say, to extract rules able to ensure correct annotation of documents to beindexed.

With respect to the shallow NLP annotation of traditional search engines, asimple implementation of machine learning principles can be organized as follows.A set of domain specific web resources are manually annotated by human expertslooking at a conceptual domain model in form of ontology. Then data-mining isapplied to extract most reliable association between lexical entities and concepts thusdefining a set of association rules characterized by a certain degree of confidence. Themost reliable associations extracted are finally used to classify the whole resource

127


set producing a set of semantic annotations (this is, for example, the basic principleof SVM classification in forthcoming version of H-DOSE).

Without regarding the method used to extract semantic descriptors from syn-tactical information, the generated set of semantic annotations has at least threepeculiar characteristics: it possesses an accuracy degree value, which describes howwell resource semantics is captured by the annotations, it is composed of many en-tities, in a number that it is likely to be not manageable by human experts and itusually models redundant semantic information, at different granularity levels.

Useful semantic information should be classified as humans do, trying to pre-serve typical features of human perception of reality and being at the same timeunderstandable by machines. Typical features of annotations created by experts areconciseness, expressiveness and focus. Human beings cannot handle huge amountsof information, therefore they evolved mental processes able to extract the key el-ements of an external object or of an external event, building concise models ableto guide decision processes when a given situation happens. At the same time sincereal world can offer different similar situation manageable in the same way, a concep-tual model should be general and expressive enough to allow reuse. Finally, mentalmodels are usually focused to a precise sequence of external stimuli, defining a sortof domain specific knowledge similar to the philosophical formalization of ontology.

Effective automatic annotation should therefore mimic those characteristics inorder to be useful in the Semantic Web, however available technologies are stillnot able to provide the required level of accuracy, conciseness and expressiveness ofsemantic descriptors. Refinement is therefore needed in order to provide effectivesemantic descriptors and from a user perspective, effective applications. Annotationrefinement requires knowledge about domain concepts, about relationships betweenconcepts and should take into account the granularity level of available descriptorssince they can be referred to entire resources as well as to single paragraphs of adocument. Such process can be modeled as an optimization process with several con-straints, some of which are not explicit, the relation between semantic annotationsreferred to the same resource as an example.

Genetic algorithms can be applied to face such problems providing at the sametime an highly dynamic system able to react quickly to changes into the initialannotation base, and to effectively face changes in user behavior.

8.1.1 Semantics powered annotation refinement

After mapping syntactical information, contained into web resources, to seman-tic annotations, a huge set of redundant, poorly expressive semantic descriptors isavailable. Almost all relevant information is contained in such set but a refinementprocess is needed in order to synthesize relevant data and to purge not useful parts.

128


There are several measures that allow to assess annotation quality, and to per-form a selection of appropriate descriptors. For each semantic annotation a relevanceweight is usually provided quantifying how strongly a web resource is linked to agiven ontology concept. A first filtering phase could therefore select the most rele-vant descriptors from the whole extracted set.

Unfortunately this simple selection method is not able to discriminate betweenrelevant and irrelevant descriptors, since it relies on inaccurate data. The relevancevalue, in fact, is provided by the automatic annotation process and may be affectedby errors on classification, association rules extraction, etc. Even in the simplestscenario, in which a set of words is associated to each ontology concept, the relevancevalue is strongly influenced by such set, and a lack in the word definition processcan compromise the real semantic relevance of produced annotation.

On the other hand manually checking each generated annotation is infeasible,even for small sets, and it would be undesirable since the entire annotation processmust be unsupervised.

The relevance value associated to each annotation should therefore handled withcaution, as an indicative value, and should be integrated by other considerations,taking into account domain knowledge specification (ontology) and granularity ofannotations. Let’s clarify the scenario with an example, supposing to have a setof seven annotations to refine. All annotations are referred to the same web page,about dog care. The web page is composed of three paragraphs respectively aboutdog nutrition, fitness and psychology. Extracted annotations are shown in Figure8.3, while the ontology branch to which they point, is shown in Figure 8.4.

In this simple case, it is possible to figure out how an automatic process couldreach performances similar to human expert classification. As first, the automaticprocess should sort annotations by raw relevance, in the example, it would pro-duce the following sets: dog care, dog, dog nutrition, nutrition and care for the firstresource, and dog nutrition, dog and care for the second one. Secondly, it shouldleverage ontological relationships between concepts, to set up selection policies basedon semantic links between topics and annotations, even if those links are not ex-plicitly modeled (inference). As an example, it should be able to understand thatdog nutrition and dog care are sub classes of the more general concepts nutritionand care, and that both topics can be applied to dogs. Finally, the refinementprocess should analyze the granularity level of annotated resources, with respect toontology hierarchy, and should change the selection order in a way that more generalresources would be annotated by more general topics.

In the proposed example, resulting annotation sets would be ordered as follows:dog, dog care, nutrition, care, dog nutrition for the first resource and dog nutrition,dog and care for the second resource. At the end a filtering process selects mostrelevant annotations providing results that, compared with the ones provided byhuman experts, possess a satisfying level of conciseness and expressiveness.

129

<?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:ex="http://www.example.org/ontology#">

<rdf:Description rdf:nodeID="A0">

<ex:uri rdf:resource="http://www.dogs.org/care"/><ex:topic rdf:resource="ex:dog" /><ex:weight>1.0</ex:weight>

</rdf:Description><rdf:Description rdf:nodeID="A1">

<ex:uri rdf:resource="http://www.dogs.org/care"/><ex:topic rdf:resource="ex:care" /><ex:weight>0.5</ex:weight>


<ex:uri rdf:resource="http://www.dogs.org/care"/><ex:topic rdf:resource="ex:dog care" /><ex:weight>1.0</ex:weight>


<ex:uri rdf:resource="http://www.dogs.org/care"/><ex:topic rdf:resource="ex:nutrition" /><ex:weight>0.5</ex:weight>


<ex:uri rdf:resource="http://www.dogs.org/care"/><ex:topic rdf:resource="ex:dog nutrition" /><ex:weight>0.8</ex:weight>

</rdf:Description><rdf:Description rdf:nodeID="A5">

<ex:uri rdf:resource="http://www.dogs.org/care#p[1]"/><ex:topic rdf:resource="ex:dog" /><ex:weight>0.5</ex:weight>


<ex:uri rdf:resource="http://www.dogs.org/care#p[1]"/><ex:topic rdf:resource="ex:care" /><ex:weight>0.2</ex:weight>


<ex:uri rdf:resource="http://www.dogs.org/care#p[1]"/><ex:topic rdf:resource="ex:dog nutrition" /><ex:weight>3.0</ex:weight>

</rdf:Description></rdf:RDF>

Figure 8.3. Example of mined annotations.

130


Figure 8.4. The ”dog” ontology.

In addition to the complexity of the optimization problem, some issues arise onreal world annotation scenarios into which thousands of annotations shall be refined.The annotation base is in fact subject to timely changes due to user requests, auto-updating processes and so on. Moreover, in many cases annotations may possessthe same relevance value, hardening the first selection phase.

8.1.2 Evolutionary refiner

The static optimization problem, for annotation refinement, is quite complex andbasically requires a multi-objective strategy to reach a good compromise betweentwo opposite goals: annotation expressiveness and annotation conciseness.

Instead of developing a solution based on multi-objective optimization algorithmsthe author designed a traditional evolutionary system1 based on a dynamic fitnessin order to fulfill the two opposite goals in a simpler and more effective manner.Moreover, the optimization task, in real world, assumes other dynamic characteris-tics, fostering the adoption of appropriate evolutionary solutions. Data set, in fact,changes continuously, depending on external events and there is no way to predicthow would the optimum change after interaction with users as well as with othersystems.

Running multi-objective algorithms every time the annotation base changes isclearly infeasible, because they are computationally expensive and a semantic plat-form should support many concurrent accesses, and should be able to perform con-current resource indexing. Evolutionary algorithms, are, instead, well suited forincremental optimization and could run continuously, working to reach a global op-timum which takes into account every single variation in the annotation base. They

1Firstly presented at CEC2004, Congress on Evolutionary Computation, Portland, Oregon,USA.

131


are also able to track global optimum changes, i.e. changes into the fitness land-scape, storing time dependent information into the population state [34], and thusallowing the implementation of effective and reactive annotation refinement systems.

In the evolutionary refinement of semantic annotations, the individual size interm of annotated topics fixes the amount of relevant annotations allowed for eachindexed resource, and should be tuned to achieve the best compromise betweenconciseness and expressiveness, limiting as much as possible information losses.

Mutation introduce innovations into the population, crossover changes the con-text of already available, useful information, and selection directs the search towardbetter regions of the search space. Acting together, mutation and recombinationexplore the annotation space while selection exploits the information representedwithin the population. The balance between exploration and exploitation or be-tween the creation of diversity and its reduction by focusing on the better fitnessindividuals allows to achieve reasonable performance of the evolutionary algorithm(EA).

Design

The evolutionary algorithm goal is to evolve a population of semantic annotationsets referred to specific web resources using genetic operators like mutation andcrossover. Individuals are composed of a fixed number of genes, each identifyinga single link between a topic and a web resource (annotation) together with thecorresponding relevance value (Figure 8.5).

Figure 8.5. An individual of the evolutionary annotation refiner.

The proposed solution uses a steady state evolutionary paradigm E(µ + λ): ateach generation λ individuals are selected for undergoing genetic modifications, theresulting µ + λ population is evaluated and the best µ individuals are transferredto the next generation. The selection operator is a tournament selection with atournament size of 2, equivalent to a roulette wheel on the linearized fitness.

The tournament size fixes the convergence rate of the algorithm, its value isintentionally kept low in order to avoid premature convergence on sub optimal so-lutions. Evolutionary annotation refinement is organized into two main phases:

132


firstly, annotations available for each indexed resource are collapsed into a small set,according to relevance weight, topic occurrence into the annotation base, and struc-tural hierarchy. Secondly, the annotation base is updated with information fromthe extracted set, taking into account the effect of the new topic coverage distribu-tion. At the end of this last phase the process restarts, thus providing a continuousrefinement cycle (Figure 8.6) able to face changes into the annotation base.

Figure 8.6. The evolutionary refinement cycle.

In the first phase, for each indexed resource the available set of annotations isextracted. Each annotation together with its relevance value become a gene storedinto a sort of resource-specific gene repository G(R). From this repository individ-uals are randomly created taking as DNA a manually tuned number of genes, thatfixes the conciseness degree of refined semantic descriptors. When the initial popula-tion has been generated, the evolutionary process evolves a population of individualswhose fitness is given by a composition of the relevance weight contribution and ofthe annotation base topic coverage contribution, with respect to annotation gran-ularity level. The annotation base has in fact a dual nature: on one side it storesannotations pointing at web resources, thus modeling indexing relevance (the valueassociated to each annotation). On the other side, it models how the conceptualspecification covers the real knowledge domain. Basically there is a certain amountof annotations pointing to ontology concepts; this value measures how well concep-tual descriptions are fitted onto the domain model: ontology concepts pointed bymany semantic annotations and located at deep levels into the concept hierarchyusually identify redundant information, or information for which syntax to semanticsconversion has provided poor results.

133


Therefore such annotations shall be penalized in order to allow the evolutionarystrategy to improve semantic expressiveness of individuals in the population.

The fitness function is defined according to the dual nature of the annotationbase, and is composed of three main elements taking into account, respectively, theindexing relevance value, the topic coverage factor just discussed and the granularitylevel of the resource whose annotations should be refined.

F (I,R) =∑

i=0→ng

W (g) · T (g)

FR(R)−D(I) · Ph

Where ng is the number of genes composing the DNA of an individual, W (g) isthe relevance weight associated to each annotation (e.g. to each individual gene,g), T (g) is the topic coverage correction factor, FR(R) is the granularity correctionfactor, D(I) is the dependency factor taking into account ontological relationshipsand Ph is the penalty factor for ontological dependence between individual genes.

The topic coverage factor T (g) is inversely related to the number of annota-tions (i.e. genes) pointing at that concept, according to the principles previouslydiscussed, while FR(R) adds information about the granularity level of individualgenes in order to make annotation refinement more effective. The basic assumptionis that a more specific resource, a paragraph as an example, should be annotated bymore specific topics, while a general resource, the entire web page as an example,should point to more general concepts.

To achieve such behavior the function FR(R) assumes higher values for morespecific resources and lower values for more general ones.

T (g) ∝ 1

#Annotations

FR(R) ∝ granularity(R)

Ph is a fixed penalty value for inter-gene dependence in the ontology and finallyD(I) is the dependency correction factor and is inversely related to the distancebetween concepts used as individual genes. Formal definitions of such values follow:

D(I) =∑

gi,gj

d(gi,gj) where g ∈ G(R)

d(gi,gj) ∝ 1

dist(gi.topic(), gj.topic())

Although the fitness function is morphological rather then time dependent, i.e. thefitness value depends explicitly from factors related to individual physical charac-teristics, there is a non explicit constraint that changes the overall behavior of thealgorithm into a dynamic one.

134


At each refinement cycle the annotation base is, in fact, updated associatingto the whole annotation set, the topic coverage distribution reached in the lastoptimization cycle. In the subsequent refinement cycle, individuals having the sameDNA of the ones in the current cycle, will possess a different fitness value accordingto results of the last optimization. Moreover, indexing cycles may happen duringthe refinement phase adding variability into the fitness landscape that the algorithmwalks to reach the optimum. In such sense, the overall evolutionary refinement cycleassumes a highly dynamic behavior.

Currently only one mutation operator has been defined and one crossover oper-ator. The mutation operator simply extracts a new gene from the gene repository,associated to a resource, and substitutes a randomly chosen gene in the individualDNA.

Recombination is performed uniformly, by selecting two individuals and alter-natively cloning individual genes. In other words, since individuals have the sameDNA size, each DNA element is selected with a uniform probability distributionbetween individuals participating in crossover. Such operator can provide invalidindividuals, i.e. individuals having one or more duplicated genes. Those individualsare invalid since resources referred more than one time to an ontology concept arenot as meaningful as needed. The author designed a simple policy to avoid sucheventuality: individuals having duplicated genes are automatically dropped out andcannot take part in the new population. Table 8.1 summarizes the adopted geneticoperators.

Operator Functional groupUniform crossover RecombinationSubstitution Mutation

Table 8.1. Genetic operators for annotation refinement.

A stop criterion has also been defined for the evolutionary refinement of eachresource annotation, based on population uniformness: whenever the diversity inindividuals falls under a given threshold, or, as well, whenever the best individualfitness stops to increase, the cycle is interrupted and the best set of semantic de-scriptors (individual) is selected as local optimum. The iterated nature of refinementassures that the global optimum could be reached, taking into account the fact thatthere is no absolute recipe to define the optimum from a user point of view, whichis the ultimate goal of refinement.

Implementation

The evolutionary refinement strategy has been deployed as a Java class defined byan high level interface, for abstraction from implementation details, as well as done

135


with individuals and genes.

The gene repository has a variable size depending on the amount of annotationsavailable for each indexed resource, while the individual size is fixed and can bemanually selected at start-time. Those elements have been integrated into the H-DOSE platform, to assess the approach feasibility. A new H-DOSE module called“Evolutionary refiner” has been developed, which practically performs the evolu-tionary refinement, and supports SOAP communication allowing interaction withexisting H-DOSE modules. The Evolutionary refiner has been integrated into themanagement logical area of the platform. Since many parameters could be tunedto optimize the strategy performance, the evolutionary annotation refiner has beenintegrated with a configuration file in which those values are stored (Table 8.2).

Parameter ValuePenalty Ph 1Dependancy d(gi,gj) 1/dist(gi.topic(), gj .topic())Topic coverage factor T (g) #AnnotationsAnnotation granularity FR(R) #(′/′ ∈ resource URI)

Table 8.2. Fitness-specific parameters.

Results

Several testbeds have been defined for the evolutionary refinement of semantic anno-tations proposed in this section. As first the evolutionary algorithm parameters havebeen set up for performing fair comparisons on different annotation bases (Table 8.3).Experiments used the last available version of the H-DOSE platform which allowssemantic indexing of web resources in many languages. The underlying ontologyhas been developed by the H-DOSE authors in collaboration with the Passepartoutservice of the city of Turin.

The ontology counts nearly 450 concepts organized into 4 main areas, for eachontology concept a definition and a set of lexical entities has been specified [35]for a total amount of over 2500 words, allowing shallow NLP-based text-to-conceptmapping.

The first experiment involved information from the Passepartout web site: 50pages were indexed using the standard methods provided by H-DOSE (simple bag ofwords) and corresponding annotations were stored into the Annotation Repository.The newly created module for evolutionary refinement ran on such data producedas output a runtime version of the Annotation Repository accessible from the searchengine service. Initial evaluation aimed at assessing feasibility of the approach andvalidity of results in a simple static scenario. Annotations stored into the repository

136


Parameters ValuesDiversity threshold 0Fitness increase threshold for the best individual 0Number of individuals (µ) 50Number of new individuals (λ) 20Probability of mutation v.s. crossover 0.1Individual size 3

Table 8.3. Evolutionary strategy parameters.

were therefore refined a fixed number of times (10) without allowing modificationson the original semantic descriptors set.

A total amount of 276 annotations corresponding to 28 relevant resources wasobtained as indexing result, the 22 remaining documents were judged not relevantby H-DOSE. After evolutionary refinement, the runtime version of the AnnotationRepository, contained about 97 annotations referred to a total amount of 21 re-sources. The difference on the amount of annotated resources was caused by fixedsize individuals which were not able to model resources for which less then 3 (sizeof individuals) annotations were present. Beside the fact that resources having avery low number of annotations are likely to be too specific or, even, wrongly anno-tated, such problem can be fixed by propagating into the refined annotation base,descriptors referred to such resources.

Semantic annotations stored into both repositories (the original and the refinedone) were evaluated by human experts which specified a relevance value between0 and 100 for each annotation. The relevance mean was evaluated both at thesingle-annotation level and at the resource level, Table 8.4 resumes the obtainedresults.

As could easily been noticed the two mean relevance values seems to be in con-trast to each other, but this is not the case. In fact, relevance at the single-annotationlevel increases since many annotations, with low relevance values, were judged notrelevant by experts, and purged by the evolutionary refinement thus increasing therelevance mean figure. On the other side, at the resource level, each annotationcontributes to the total relevance value, therefore it is straightforward that, withoutregarding other parameters, as higher is the amount of annotations for each resource,as higher is the corresponding relevance value (Figure 8.7).

In the better scenario, all annotations purged by the evolutionary system werejudged completely non-relevant by human experts, therefore relevance mean valueswere the same for the original and the refined annotation base.

Overall performance was interesting since the total expressiveness reduction ofthe annotation base, in terms of relevance at the resource level, was around 36%

137


Quality figures Original base Refined baseTotal annotated resources 28 21Resources with size ≤ Individual.size() 7 0Total amount of annotations 276 97Single annotation relevance (mean) 1.92 3.13Resource annotation relevance (mean) 9.67 6.19Expressive density de 0.035 0.063

Table 8.4. Evolutionary refinement results on the passepartout web site.

Figure 8.7. Human-evaluated relevance at the resource level (Passepartout).

while the repository size reduction was of about 65%. The expressive density de,defined as the relevance over the number of stored annotations, was respectively of0,035 for the original annotation base and of 0,063 for the refined one.

de =mean(relevance)

#Annotations

The second experiment involved a real time scenario, in which the H-DOSEplatform works on-line and indexing can be triggered by automatic processes, ex-ternal applications and search failures, at whatever time in the refinement process.Since the evolutionary refinement works continuously, taking a sort of snapshot ofthe Annotation Repository content at each refinement cycle, changes in the originalrepository propagates to the refined one with a maximum delay of one cycle.

The experiment involved the indexing of on-line web resources from the “asphi”web site [36]. Asphi is a private foundation that promotes the adoption of informaticsaids for people with disabilities, by financing several research projects.

A snapshot of the two repositories was taken and the same evaluation done on

138

8.2 – OntoSphere

the static scenario was performed. Table 8.5 details corresponding results, while thechart in figure 8.8 depicts the corresponding relevance, at the resource level, for eachindexed fragment.

Quality figures Original base Refined baseTotal annotated resources 19 16Resources with size ≤ Individual.size() 3 0Total amount of annotations 242 63Single annotation relevance (mean) 0.64 1.33Resource annotation relevance (mean) 7.47 3.81Expressive density de 0.031 0.060

Table 8.5. Evolutionary refinement results for the www.asphi.it site.

On-line work and user related changes into the original Annotation Repository donot affect too much the refiner behavior, allowing to reach satisfying performanceseven on dynamic fitness landscape. As could be easily seen looking at the experi-mental data, the relevance loss can be estimated around 50% while the repositorysize reduction is of about 75%, with a correspondent expressiveness density of 0.060for the refined annotation repository and of 0.031 for the original one.

Figure 8.8. Human-evaluated relevance at the resource level (www.asphi.it).

8.2 OntoSphere

In the last years the Semantic Web has been constantly evolving from a vision of fewpeople to a tangible presence on the Web, with many tools, portals, ontologies,etc.

139


Such a great evolution involved many researchers, from different countries, andhas been primarily focused on technologies. At now a Web developer can start toseriously consider the opportunity to provide semantically tagged content as theneeded tools and standards are available. However, the current web panoramashows a very little adoption of semantics. The motivations for such a low adoptioncan be various and related to very different aspects: technology immaturity, failingdissemination, user and developer resilience to changes, etc. In the sea of possiblefailures and shortcomings, interfaces have a relevant role often discriminating goodsolutions from bad ones. This is particularly true for tools related with knowledgemodeling and visualization, where the involved information can be quite complexand involve multidimensional issues.

Several attempts aim at providing effective interfaces for knowledge modeling,i.e. for ontology creation and visualization. Protege [37] and OntoEdit [38] for ex-ample are complete IDEs (Integrated Development Environments) that address ina single application all the aspects related to ontology creation, checking and visu-alization (through proper plug-ins). Such tools, although adopting rather differentparadigms for editing and inspecting ontologies, have in common a bi-dimensionalapproach to ontology visualization. The bi-dimensional approach can be variouslyefficient and there are actually good solutions available on the web: GraphViz [39],Jambalaya [40] and OntoViz [41], just for naming a few. Nevertheless, mapping themany dimensions involved by an ontology like the concepts hierarchy, the seman-tic relationships, the instances and the possible axioms defining a given knowledgedomain, on only two dimensions can sometimes be too restrictive.

The author, in collaboration with some colleagues2, proposes OntoSphere3, anew tool for inspecting and, in a near future, for editing ontologies using a morethan 3-dimensional space. The proposed approach visualizes the mere topologicalinformation on a 3D view-port, thus leveraging one more dimension with respect tothe current solutions. This allows, at least, to better organize the visual occupationof represented data. Being the 3-dimensional view quite natural for humans, espe-cially for what concerns navigation, the proposed approach can be more effective inbrowsing as it involves “manipulation-level” operations such as zooming, rotating,and translating objects.

In addition, many more dimensions are introduced to convey information onthe visualized knowledge model (meta-information). The extension of the subtreeslying under the currently viewed concepts is, for example, visually rendered byincreasing the size of the visual cues adopted for them. The same approach isapplied to colors, which are used to add insights on the representation: blue spheres,for example, indicate that the corresponding concepts are terminal nodes in the

2Alessio Bosca which developed the 3D visualization panel and Paolo Pellegrino3Firstly presented at SWAP2005, Semantic Web Applications and Perspectives, Trento, Italy

140

8.2 – OntoSphere

ontology. Transparency is used for distinguishing inherited, or inferred, from directrelationships; shapes are used for differentiating concepts and instances and so on.

Together with the ways to convey more information to the users through sev-eral visual dimensions, the proposed work also aims at tackling the space allocationissues for ontology visual models. In fact, in the traditional solutions, big ontolo-gies can easily lead to overcrowded representations that are difficult to browse andthat can be more confusing than aiding. Some attempts exist to overcome theseproblems, as in OntoRama [42], where the nodes being inspected are magnified withrespect to the other nodes in the ontology. However, even these approaches tendto collapse when visualizing big ontologies such as SUMO [43], counting over than20’000 concepts. The proposed application, instead adopts a dynamic collapsingmechanism and different views, at different granularities, for granting a constantnavigability of the rendered model.

8.2.1 Proposed approach

Applications for ontology visualization abstract from the formalism of the underlyingdata, and graphically model the information contained in a given knowledge base(KB). Each tool presents data according to its own approach but generally all ofthem share the same choice of using a 2D space. The proposed approach, instead,leverages the use of a 3D space as a mean to effectively represent and explore datathrough an intuitive interface.

The application objective consists in enhancing the performances of current solu-tions in terms of completeness and readability; in fact the OntoSphere application isable to graphically represent both the taxonomic and the not taxonomic links as wellas selecting and presenting information on the screen at an appropriate detail levelaccording to what is relevant to the user’s interest. Furthermore, the tool providesan intuitive navigation interface, featuring direct manipulation of the scene (rota-tion, panning, zoom, object selection, etc.) and is designed to particularly meet thedemands of domain experts who have little technical skills in the field of SemanticWeb, and therefore specifically rely on graphical interfaces.

The choice of a three-dimensional environment constitutes the starting point indesigning the tool, as a 3D space offers one more dimension than traditional 2Dapproaches to represent ontology data, so simplifying its interpretation. In order toachieve completeness and readability, OntoSphere operates according to two differentprinciples:

• To increase the number of “dimensions” (colors, shapes, transparency, etc.)which represent concepts features and convey additional information withoutadding the burden of further graphical elements, such as labels, on the scene.

141


• To automatically select which part of a knowledge base has to be displayedand the detail level that has to be used in the process, on the base of userinteraction with the scene.

In particular, the latter principle is particularly important for improving the overallsystem performances, since scale factor indeed constitutes a strong issue in visu-alizing complex graph-like structures, i.e., ontologies. As cardinality of elementsincreases, the number of items to be concurrently displayed on the screen worsensthe graphical perception of the scene and complicates spotting details. When theamount of visualization space needed to represent all the information within theKB outnumbers the one available on the screen, a few options remain available: toscale down the whole image to the detriment of readability, to present on the screenjust a portion of it and allow its navigation or to summarize the information in acondensed graph and provide means for its exploration and expansion. As the ef-fectiveness of these options depends on the use case involved (consistency checking,domain comprehension, KB updates) a combined usage of them can offer a bettersolution than the separate adoption of one of the three.

OntoSphere tackles ontology visualization by exploiting different scene mangers(respectively the RootFocus,the TreeFocus and the ConceptFocus scene manager)that present and organize the information on the screen, each according to a differ-ently detailed perspective. These managers alternate each other in managing thegraphical space as the user attention (inferred from user’s interaction) shifts fromone concept to another.

The proposed user interface is very simple and allows direct manipulation of thescene through rotation, panning and zoom; it permits to browse the ontology as wellas to update it and to add new concepts and relations. Every concept within a givenscene is clickable with two different results: a left click performs a focusing operation,shifting the scene to a more detailed level, while a right click maintains the currentperspective and simply navigates through elements. For example, left-clicking on aconcept in the global scene would lead to the visualization of the related concepttree while right clicking on it would lead to the visualization of its children in thesame perspective, as explained in more details in the next sub-sections.

RootFocus scene

This perspective presents a big “Earth-like” sphere bearing on its surface a collectionof concepts represented as small spheres (Figure 8.9). The scene does not visualizeany taxonomic information and only shows direct “semantic” relations between el-ements of the scene, usually a not completely connected graph. Atomic nodes, theones without any subclass, are smaller and depicted in blue while the others arecolored in white and their size is proportional to the number of elements containedin their own sub-tree.

142

8.2 – OntoSphere

This view is particularly intended for representing the ontology primitives, i.e.,the root concepts, but can also be used, during the navigation of the ontology, inorder to visualize direct children of a given node; a pretty useful option in caseof heavily sub-classed concepts. Topmost concepts within the ontology and therelations between them define the conceptual boundaries of the domain and providea very good hint to the question: “what is the ontology about?”

Figure 8.9. The OntoSphere “RootFocus” scene.

TreeFocus scene

The scene shows the sub-tree originating from a concept; it displays the hierarchi-cal structure as well as semantic relations between classes. Since usage evidenceproves that too many elements on the screen, at the same time, hinder user at-tention, the scene completely presents only three fully-expanded levels at a timeand, as user browses the tree, the system automatically performs expansion andcollapse operations in order to maintain a reasonable scene complexity. The reader

143


may note in Figure 8.10, on the left, how focusing the attention on the concept“ente pubblico locale”, on the left in the figure, causes (with a simple mouse click)the vanishing of the uninterested branches, then collapsed in their respective parents(Figure 8.10, on the right). Collapsed elements are colored in white and their size isproportional to the number of elements present in their sub-tree; instead conceptslocated at the same depth level within the tree have the same color in order to eas-ily spot groups of siblings. Hierarchical relationships within the scene are displayedwith a neutral color (gray) and without label, whereas other semantic relations in-volving two concepts already in the scene are displayed in red, accompanied by thename of the relation (as in the “RootFocus” perspective). If an element of the treeis related to a node that is not present on the scene a small sphere is added for thatnode in the proximity of its visual clue, so terminating the end of the arrow: in suchcases, incoming relations are represented with a green arrow, while outgoing linkswith a red one.

Figure 8.10. The OntoSphere “TreeFocus” scene.

ConceptFocus scene

This perspective depicts all the available information about a single concept, at thehighest possible level of detail; it reports the concept’s children and parent(s), itsancestor root(s) and its semantic relations, both the ones directly declared for thegiven concept and the ones inherited from its ancestors.

Semantic relations are drawn as arrows terminating in a small sphere: red if therelation is outgoing and green otherwise (Figure 8.11). Direct relations are drawn

144

8.2 – OntoSphere

close to the concept and with an opaque color, while inherited ones are located abit farther from the center and depicted with a fairly transparent color.

This scene is pretty useful during consistency checking operations because it easethe spotting of inconsistent relations whenever a concept inherits from an ancestora property that “conceptually” contrasts with other features of its own.

Figure 8.11. The OntoSphere “ConceptFocus” scene.

8.2.2 Implementation and preliminary results

The work presented in this section has been entirely developed in Java. The choice isrelated to the current panorama of ontology editors and of tools for ontology creationand maintenance, which are in the majority of cases developed in this language.Among the other advantages, Java permits to use such tools in different operatingenvironments, from devices with low computational power, to high performanceworkstations.

The visualization engine uses the Java 3D API to produce a three-dimensionalinteractive representation of ontology concepts and relationships. This API is di-rectly linked with an underlying Open GL engine that provides the required graphicscapabilities. The ontology related part, instead, is based upon the well-known Jenasemantic framework from HP [23] which allows to easily load, manage and modifyontologies and taxonomies written either in RDF [4], RDFS, DAML, OWL [5] orN-triple. These two main parts are the core modules of the implemented applica-tion, conciliating in a single working space capabilities for visualizing and editingontologies in various formats.

145


In order to understand if the proposed approach is valuable and scalable enough,the authors set up three different test beds. The first one assesses the complianceof the tool with the initial requirements; the second one evaluates the tool utilitywhen it is applied to real world cases and the last test bed investigates whether thecurrent deployment is able to manage complex ontologies or not.

In the first experiment the application has been tested according to the praxisfor agile development and for requirements satisfaction checking. All the modulescomposing the platform have been developed starting from a rather precise specifi-cation and have been tested according to predefined JUnit [44] tests. After passingthe basic functionality checking, the entire application has been tested against threedifferent use cases including: simple ontology browsing, “conceptual consistency”checking and ontology development. In the ontology browsing case, a group of 8users has been required to load and browse 5 different ontologies. The goal was toguess the domain of the chosen ontology and to analyze the granularity of the knowl-edge model. The 5 ontologies used in this experiment are: the well-known Pizzaontology from Protege, the SUMO (Suggested Upper Merged Ontology) ontology, amusic ontology developed from scratch by the authors, the CABLE ontology and thePassepartout ontology developed by the authors in collaboration with the Passep-artout service of the Turin’s municipality. Results for each of the ontologies arereported in Table 8.6.

Ontology Domain Topic and level of detail identificationU1 U2 U3 U4 U5 U6 U7 U8

SUMO (OWL) General � × × � � × � �MUSIC (RDF/S) Music � � � � � × � ×CABLE (OWL) � � � � � � � �PASS. (OWL) Disability � × � × × � � �PIZZA (OWL) Pizzas � � � � � � � �

Table 8.6. Results for the ontology browsing use case.

Checking the ontology for “conceptual consistency” is rather different from theformal consistency checking done by logic reasoners. What has to be checked isnot the ontology consistence for reasoning and inference, but whether a user candetect domain-related inconsistencies created during the ontology design process.For example, a concept may inherit some relationships that are not appropriatefor it, either because of a wrong parent-child relation or because of a previouslyundetected error in the domain modeling: in this case the ontology is formallyconsistent but not conceptually.

The ontologies involved in this test were the same used in the previous one, aswell as the users. Some interesting aspects came out from the experimentation: the

146

8.2 – OntoSphere

detection of “conceptual inconsistencies” through the observation of the ontologyrepresentation appears, in fact, strongly dependent on the dimension of the knowl-edge domain and on the expertise that the user has in that domain. So, for example,by looking at Table 8.7 it is clearly noticeable that in the SUMO ontology no in-consistencies were found, as it is, in fact, huge and well designed, while the involvedtesters had a very poor background on the SUMO domain. On the other side, inthe CABLE ontology almost all inconsistencies were detected since the ontology issmall (80 concepts) and the domain was well known by all the experimenters. Inconclusion, determining whether the proposed OntoSphere application is or is notable to evidence inconsistence is very difficult, since the involved factors are diverseand can interact in complex patterns.

Ontology Known Detected inconsistenciesU1 U2 U3 U4 U5 U6 U7 U8

SUMO (OWL) / 0 0 0 0 0 0 0 0MUSIC (RDF/S) 6 0 0 4 2 6 0 4 2CABLE (OWL) 2 2 2 2 2 0 2 0 2PASS. (OWL) 12 4 8 0 3 10 3 0 7PIZZA (OWL) 0 0 0 0 0 0 0 0 0

Table 8.7. Results of the “conceptual inconsistencies” checking.

When the proposed application is used for ontology development, the supportprovided for detecting conceptual inconsistencies is much more evident. The adop-tion of OntoSphere for inspecting the work in progress allows, in fact, to easily detectmodeling errors. In particular, the mostly recognized errors are about relationshippropagation along the ontology hierarchy and wrong definitions of parent-child (isA)relationships.

Although it is quite difficult to fill-up a table for showing how, and to what ex-tent, the proposed application supports the process of ontology creation, interviewswith users evidence that many times the experimenters are able to quickly spot themodeling errors. Their opinion indicates the intuitive visualization and the capa-bility to visually represent inherited and inferred relationships as the main factorsfor achieving success in their own modeling process. This last experiment actuallylies between the functional tests and real world test cases. However, to provide amore grounded experimentation (please note that the results here presented are stillvery preliminary) the authors performed a real world test in the occasion of the fi-nal meeting of CABLE, a European MINERVA project on “CAse Based e-Learningfor Educators”. In that meeting, a demo of the OntoSphere application has beenpresented to visualize the ontology developed in the context of the CABLE project.The exciting result is that, rather than complaining about the complexity of the

147


provided interface, or about the appearance or the controls for browsing the ontol-ogy, the first observation was: “No! That relation can not subsist between those twoconcepts!”. What surprisingly happened is that the application was able to highlightthe inherited relations so that the errors were spotted in few minutes of ontologybrowsing. This is clearly a not scientific result since experiments are to be conductedin a controlled environment, shall have a clear objective and must be carried on bya significant group of users. And the aim of this paragraph is not to sustain thethesis that assumes such reaction as a good result. However, the user reactions inthe CABLE meeting are encouraging signals that the still preliminary OntoSphereapplication can be a valuable instrument in ontology design and development.

As last experiment, a simple scalability test was performed: the goal was to un-derstand if OntoSphere is able to load and visualize ontologies having great amountsof concepts and relationships. The entire SUMO ontology was therefore loaded andbrowsed and the loading process took around 3.5 seconds, while navigation wasperformed in real-time. SUMO is the Suggested Upper Merged Ontology and it iscurrently released under a GPL license. It counts about 20’000 concepts related byover 60’000 axioms.

There are still some issues to be fixed when browsing really huge ontologies: thevisualized concepts tend to clash if the number of concepts visualized at the sametime is high. Also the labels tend to overlap making the visualization more difficultto manage (as in many other viewers). Moreover, since a human cannot take intoaccount more than a reasonable number of objects at a time, huge graphs shall becollapsed and different ontology navigation patterns and interfaces shall be provided.

148

Chapter 9

Semantics beyond the Web

This chapter discusses the extension of HDOSE principles and techniquesto non-Web scenarios, with a particular focus on domotics. An ongoingproject on semantics reach house gateways is briefly described highlight-ing how the lessons learned in the design and development of HDOSEcan be applied in a complete different scenario, still retaining their valu-ability.

In this chapter a simple overview of a currently on-going work is provided, describingthe first steps done by the author, together with some colleagues working in the sameresearch group, to integrate semantics into domotic applications. In the presentedapproach only a seminal, yet interesting, notion of semantics is adopted, which israther “encoded” inside the presented approach.

Semantics is seen at two different granularity levels: at a device level, each objectin a house that is capable of communicating with a PC, either directly or throughpre-processing of data, is organized into a taxonomy which describes the functionalknowledge associated to it. So, for example, a dimmer light, i.e. a light whoseintensity can be modulated, is classified as a subclass of the more generic lightdevice that only supports on and off commands.

At the house logic level, instead, semantics is seen as a composition of rules(either specified by households or automatically learned by observing the behaviorsof house inhabitants) and inferences from general purpose knowledge. The visionto which tend can be depicted more clearly with an example: let us imagine thatevery time the dark is approaching, a person closes the house shutters and lights-upthe illumination in those rooms where he/she is located. During the Winter such abehavior is repeated every day, at around the 5 p.m. (in Northern Italy) while inthe Summer the same happens around 10 p.m. since the daylight stays for a longertime. A simple rule-based logic level would probably classify the two situations asdifferent behaviors and would probably have several difficulties at deciding whether

149

9 – Semantics beyond the Web

closing the shutters at 5 or at 10 p.m. A semantically enabled system, instead,would probably possess some general knowledge on the house environment, indoorand outdoor. For example, it might have a notion of daylight, and as a consequenceof dawn and dusk. Therefore, in the same scenario, the semantically enabled domoticcontroller would probably infer that every day, at dusk, shutters shall be closed andlights activated in rooms where household are present. In such a case seasons havenot any influence on the automated process since the time at which dusk happenschanges according to them and since the action is triggered by the dusk state ratherthan by the day hour.

Deciding “a priori” which kind of knowledge shall be encoded in the house, andto what extend learning shall be enabled is a rather difficult task, however themain goal of the on-going research effort is not of providing omni-comprehensivesolutions, or “human-level” intelligence for house automation systems. Instead, themain goal is to understand which advantages and functionalities can be obtained byapplying available, off-the-shelf technology in rule-based systems and semantics tohome automation systems, and to what extend such technologies can be of aid foreasing the everyday life of elderlies or disabled people.

The system here presented is only a first tentative of defining an house con-trol architecture able to provide normal automation services as well as rule-basedbehaviors. Semantics is still implicit in the taxonomic organization of controlled de-vices (functional level) while general purpose knowledge and the related inferencesstill have to be introduced. However the design effort aims at providing a platformarchitecture able to support the seamless integration of these functionalities.

9.1 Architecture

In this section an overview of the Domotic House Gateway (DHG) developed by theauthor and his colleagues is presented, followed by a more detailed description ofinvolved components. The proposed system has been designed with the intent ofsupporting intelligent information exchange among heterogeneous domotic systemswhich would not otherwise natively cooperate in the same household environment.

The concept of event is used to exchange messages between a device and theDHG. As will be seen, these low-level events are converted to logical events internallyto the DHG so as to clearly separate the actual physical issues from the semanticsthat lies beyond the devices and their role in the house (functional semantics). Inthis way, it is also possible to abstract a coherent and human-comprehensible viewof the household environment.

The general architecture of the DHG is deployed as shown in Figure 9.1, and canbe divided into three main layers: the topmost layer involves all the devices thatcan be connected to the DHG and their software drivers, which adapt the device

150

9.1 – Architecture

technologies to the gateway requirements. The central layer is the main responsiblefor routing low-level event messages to and from the various devices and the DHG,and also includes the generation of special system events to guarantee platformstability.

Figure 9.1. Domotic House Gateway Architecture

The last layer is the actual core of intelligence of the system. It is devoted toevent management at logical (or semantic) level and to this purpose it includes arule-based engine which can be dynamically adjusted either by the system itself ormanually through external interfaces. The rules define the expected reactions toincoming events, which can be either generated by the house, let’s say the “door isopening” as an example, or by an application: “open the door”.

Complex scenarios may involve the rule engine: for instance, a rule might gen-erate a “switch on the light in room x” event if two events occur: the room x isdark and a sensor revealed the presence of somebody in that room. Additionally, anautomatic rule learning algorithm is under study: the logged events are processedto infer common event patterns that are very likely to be repeated in the future.New rules can be firstly proposed to the households and then possibly accepted andadded to the existing rule set. Some interfaces permit to check, modify, add ordelete each of the rules. Different rule sets may eventually be used.

In the next sub-sections, a detailed description is presented for each block ofFigure 9.1.

151


9.1.1 Application Interfaces, Hardware and Appliances

In a domotic house, a person interacts with many devices that may be connected tothe house gateway, from a simple light switch to a set top box, from a mobile phoneto a computer. The type of connection as well as the configuration for each devicemay vary and often depends on it use: low-cost wired domotic buses are well suitedfor controlling simple devices and smart appliances; wireless connections facilitatea continuous intercommunication between different locations but may be disturbedand not always reliable. Ethernet connections offer a high bandwidth which can beused to transmit videos, etc.

Therefore, in order to uniformly control all the devices it necessary to offer acommon point of aggregation, the DHG. Domotic systems usually include a controldevice which permits to manage all the devices connected to the domotic bus. In thiscase this single device can be connected to the DHG. Other recent smart appliancesand devices (digital TVs, handhelds, etc.) can be accessed by modern domoticinstallations, but are often also accessible via computer oriented connections, asthey generally embed simple computer systems.

In general, any device may communicate with the DHG using its preferred pro-tocol, as the information exchange is handled by a specific driver for each typeof device. So, a web application running on a common PC could use the SOAPprotocol; a simple terminal might use raw sockets, and so on.

The configuration of each device is based on a simple and generic structured(taxonomic) model of the house (which could actually be extended to more generalenvironments), as depicted in Figure 9.2.

Figure 9.2. Sample device instances in a House Model.

Basically, at low level, each device is associated to the appropriate driver, whileat high (or logical) level the device is assigned a unique identifier and type, as well asa logical placement in a location of the house (e.g., the living room). For instance,a light might be registered as the 11-th device within the driver that controls itsdomotic system (and which is presumably connected to its main control device),while its type is a light, supporting on and off events, and the location is the kitchen

152


(Figure 9.3). Of course, special devices such as application interfaces running onmobile devices can be located into “virtual” rooms.

<house>

<roomname=’’kitchen’’>

<devicename=’’light’’ devID=’’11’’

devType=’’Light’’ driver=’’BTicino’’ />

</room>

<room name=’’living room’’>

<devicename=’’set tob box’’

devID=’’192.168.1.33’’

devType=’’STB’’ driver=’’STB’’ />

</room>

</house>

Figure 9.3. Example of device configuration

Configurations are provided by XML files, possibly specifying predefined sets ofrules that can enhance the use of the devices. The device drivers are dynamicallyplugged into the DHG in order to support the most diverse house configurations,especially if only temporary (e.g., in case of guest mobile devices) or changing fromtime to time (e.g., a new computer has been added).

9.1.2 Device Drivers

The device drivers in the DHG are responsible for translating low level or hardwarestates and activities of the devices (a light switch, a door, a software application, etc.)into events. As mentioned above, each device may need to use a specific protocolto communicate with the DHG. Therefore, it is necessary to develop specific driversfor each type of device. To this purpose, some simple guidelines are provided for thedevelopment and integration of new drivers: when plugged into to the DHG, eachnew driver must:

• register itself with a unique identifier;

• for each supported device:

– if the device type (class) does not exist, register it as a subclass of anexistent type (e.g., root);

– register the new device with the associated type;

– correctly handle (receive and possibly send) events for each device ac-cording to its type (and “super-types”).

153


It should be noted that the registration of a new device type should imply theextension of an existent type (class or concept), whose events are inherited, andmay also involve the registration of new events (Figure 9.4).

A number of predefined types of devices are initially provided, as well as a listof supported events for each device type. This information will become part of thehouse knowledge base (or House Model), as should facilitate the design of driversand devices especially in terms of interoperability.

Figure 9.4. Sample device types and supported events.

9.1.3 Communication Layer

The main task of the Communication Layer consists in routing low-level events tothe correct destination: events coming from a device are sent to the Event Handler,whereas events from the EH must be sent to the correct device driver for furtherprocessing (e.g., to switch on the correct light). The association between devicesand drivers is created whenever a new device is instantiated by its correspondingdriver. Additionally, a special instance property maintains the “address” (or ID)which identifies the device within the scope of the driver. For instance, a drivercontrolling a domotic system may use a unique single number for each of the deviceswhich are under its control, whereas a driver that interfaces computer applicationsmay use IP addresses or URLs.

The Communication Layer is also responsible for the management of some driverrelated issues, such as loading configurations or handling possible driver errors. Forinstance, it may generate system events which are sent to the Event Handler forfurther logging or error recovery, thanks to special rules.

154


9.1.4 Event Handler

The EH translates low-level events into logical events according to the house model,and viceversa. The main task of the EH is the conversion of the device driveraddressing (the instance ID) to a high-level name, which correlates the device toits function or role in the house. In this way, it is possible to hide specific deviceaddressing details to the Domotic Intelligence System (logic and/or rules), which al-lows the DHG to autonomously and automatically perform actions on the connecteddevices.

The uniqueness of a device instance is guaranteed by its logical name, whichincludes the location (even if fictitious) and a unique name within such location,consistently with the house model. E.g.: {driver “BTicino”, devID “11”, event“...”}⇔{room “kitchen”, devName “light”, event “...”}

9.1.5 House Model

As already mentioned, the house model represents, with a structured and logicalscheme, the house devices and the events they support. The house environment issubdivided in a collection of rooms and for each room the corresponding devicesare specified as instances of a supported device type. This representation facilitatesthe design and configuration of the various devices, as well as their utilization, eventhrough the definition of scenarios, which coordinate the control of multiple devicesthrough a single action (e.g., pressing a button).

Beside the house configuration, the types of devices are also structured in ataxonomy in order to explicitly correlate devices as specific subclasses of simplerdevices: for example, a light dimmer is a specific case of light. Each device type islinked to the supported events, and these links are automatically inherited by thedescendant device subtypes (or subclasses...), to guarantee that specific devices canalways be controlled as a simpler ancestor. So, for instance, it should always bepossible to use a dimmer as a simple light bulb, supporting “switch on” and “switchoff” events (Figure 9.4).

The house model is (re)populated whenever a driver is plugged into the DHG,and new device types may be registered as well as new events that they may support.A minimal set of device types and of events is provided through a built-in fictitiousdevice driver, which may be used for testing.

The house model is also used to map the existent devices, room by room, tothe correct driver and to a registered device type and events according to simpleXML configuration files. In addition, it may also be exploited into the DomoticIntelligence System to improve its intelligence and dependability, though at the costof increased complexity.

155


9.1.6 Domotic Intelligence System

The DIS permits to generate new (logical) events at run-time basing either on eventscoming from the house through the Event Handler or on predefined or inferred“rules” that may, for instance, act at a specified time. The current implementationof the DIS is based on a run-time engine that processes rules according to the eventsreceived by the Event Handler.

When certain conditions are met, new events may be generated and sent backto the Event Handler that routes them to the correct devices through the Commu-nication Layer. The rules can be preloaded or added either manually via externalinterfaces, such as a simple console, or through the Rule Miner, which examinesthe event log (see the Event Logger) to infer new rules. At this moment, some rulecontrol mechanisms are being studied to prevent annoying or dangerous situations.At the very least, it is possible to save the rules and the status of the DIS at anytime for future reloading.

9.1.7 Event Logger

The event logger receives events from the Event Handler and saves them in a file inorder to facilitate the identification of possible erratic behaviors. Both logical eventsand system events can be logged, and through external interfaces it is possible tospecify filters in order to limit the amount of stored data. The Event Logger isalso used by the Rule Miner to (semi-automatically) generate rules for the DomoticIntelligence System.

9.1.8 Rule Miner

The Rule Miner tries to infer new rules for the Domotic Intelligence System byreading and analyzing the event log. The idea is to identify common event patternsso as to forecast and automatically generate events according to such patterns.However, this is still work in progress, as it is also necessary to keep into accountdangerous situation as well as possible conflicting actions, which should actuallyprevent the execution of certain inferred rules. To this purpose the Rule Minermight exploit the House Model and other sources of information as knowledge baseto achieve a more consistent understanding of what the event represent, and thereforeof what is happening in the house.

156

9.2 – Testing environment

9.1.9 Run-Time Engine

The Run-Time Engine receives events at run-time from the Event Handler, parsesthem according to a rule-based system and, if such input events meet certain con-ditions, it generates new events, which are then sent to the Event Handler andconsequently to the Communication Layer for routing to the devices. The eventprocessing and generation mechanisms could actually be implemented using anytechnique (generally related to Artificial Intelligence). However, according to theauthors, techniques like artificial neural networks or genetic algorithms render ratherdifficult to precisely and securely control events. Therefore, as mentioned since thebeginning, current implementation of the RTE is based on a rule system, nominallythe open source Drools framework [45].

In this way, the status of the engine can be easily initialized by loading a con-figuration file. Additional configuration files or user interfaces (such as a simplecommand-line parser or even interfaces based on Natural Language Processing tech-niques) permit to modify, restore or save the RTE status at run-time in order toadjust or fix improper or erratic settings manually.

9.1.10 User Interfaces

A number of different interfaces (a console, a Natural Language Processing interface,other graphical user interfaces) can be provided to permit the manual configurationof the Run-Time Engine so as to fully keep under user’s control the Domotic In-telligence System. At now, a simple command-line interface is available, but aNLP-based interface is under study to facilitate the interaction with non expertusers.

9.2 Testing environment

In order to test the actual feasibility of the proposed system, which has been de-veloped in Java mainly for portability issues, a number of different devices andinterfaces have been used. They are briefly presented in the following paragraphsand explained in more details in subsequent sections, followed by the explanationof the actual experimental setup. The most relevant element is the domotic housenear the authors’ laboratories and it is equipped with a home automation systemproduced by BTicino, a leading Italian industry for house electrical equipment. Thehouse is part of a scientific and technological park maintained by C.E.T.A.D. [46]and dedicated to promotion, development and diffusion of technologies and innova-tive services for rehabilitation and social integration of elderly and disabled people.

To complete the picture of the tested devices, two additional elements are to becited. The first one is a simple parallel port connected to eight small led and driven

157


by a common PC running Linux. The second and last one is the MServ open sourceprogram [47], again running under Linux, capable of remotely choosing and playingmusic files.

9.2.1 BTicino MyHome System

The MyHome system [48] developed by BTicino is a home automation system ableto provide several functionalities as requested by the increasing needs of users forsmart and intelligent houses. These functionalities cover several aspects of domoticssuch as comfort configurations, security issues, energy saving, remote communicationand control. The common framework in which every available device is deployed isbased on a proprietary bus called “digital bus” that conveys messages among theconnected devices and that provides them the required electrical power.

The most salient characteristic of the Bticino system is what they call the controlsub-system, i.e., the ability to supervise and to manage a home by using a PC, astandard phone or a mobile phone. The control in the BTicino system can beeither local through a LAN connection, as experimented in this paper, or remotethrough an Internet connection or a telephonic network. Through a proprietaryprotocol, it permits to manage all the devices of the house, e.g. lights, doors,shutters, phones, etc. This component permitted to interface the BTicino system tothe DHG by simply exploiting its features, instead of connecting each device to theDHG. Therefore, a single driver has been prepared to handle the communicationbetween the control and the DHG.

Additionally, a number of specific modules has been provided to manage basicdevices such as lights, doors, shutters, and alike. It should also be noted that thisdriver must poll the control to check the status of the domotic devices and convertthe returned information into events for the DHG. This is due to the fact thatthe BTicino system installed in the testing environment does not support eventsnatively.

9.2.2 Parallel port and LEDs

Eight Light Emitting Diodes have been wired to the data lines of a standard IEEE1284 parallel port connected to a PC running Linux. The driver for the DHG isbased on a simple TCP/IP server that drives the parallel port as a generic paralleldevice using Linux-specific calls, so that the DHG is enabled to control each of theeight lights. They have been used to test the rule system, as well as to demonstratethe flexibility of the proposed work. Legacy or special purpose devices are in factstill based on this simple technology (and the standard RS-232 is another example).

158

9.3 – Preliminary Results

9.2.3 Music Server (MServ)

This open source Linux application is basically a music player that can be controlledremotely, therefore acting as a server. It exposes a TCP/IP connection to receivecommands such as play, next song, stop, etc., and to provide its status (e.g., whichsong is being played, as it is normally randomly chosen). MServ is normally ac-cessible through a simple telnet application, but also HTTP CGI and several GUIapplications are available for a more easy interaction with the system.

The DHG driver for MServ is therefore rather simple even in this case. In fact,it only needs to exchange simple strings through a TCP/IP connection and to parsethem appropriately (some informative message may be related to the status changecaused by other connected clients, etc.). Events like play and stop are accepted ascommands, while relevant status messages are sent to the DHG for logging purposesas, at now, no visual interface has been provided.

9.2.4 Experimental Setup

The DHG has been installed on a common PC, running a Java Virtual Machine(JVM), on an AMD 1800+ processor with 512MByte of RAM (the DHG is expectedto work fine on fairly less performing PCs). Two other similar PCs have been used forthe parallel port and the MServ application respectively. A simple Ethernet switchserved as physical connection for the three computers and the BTicino control server,which also exposes a standard RJ45 connector and supports TCP/IP connections.

Once all the systems were ready and functional, the three drivers mentioned inthe previous sections have been plugged into the DHG. The house model has beenprovided as a simple XML configuration file, specifying the actual device instancesviewable through each of the drivers. So, for instance, each of the LEDs has beenidentified through a number from 0 to 7, as this is sufficient for its driver, while theirtype is a light, as all of them can only be switched on or off.

Some basic rules have also been added to the DHG, mainly to test the rule systemand to understand if any unexpected issue may arise in practice. So, for example,whenever the entrance door is opened, a rule is activated and generates an eventthat makes the LED 4 to switch on. Conversely, when the door is closed, anotherrule sends a “switch off” event to the same LED.

9.3 Preliminary Results

Two aspects have been considered: the device drivers and the rule engine. In thefirst case, all the drivers have been created and with rather little effort (a few man-hours for each of them, including debug), and the interactivity with the DHG provedto be very stable as no crashes have ever been registered.

159


The simple rules inserted into the system were also appropriately executed. So,for instance, the light was automatically switched on after the windows in the sameroom where all shuttered, the shutters being controllable domotic devices. However,particular attention is necessary when inserting rules into the system. In fact, re-ferring to the previous example, if the shutters remain closed one must still be ableto switch off the light, without causing the system to switch it on shortly there-after. These types of issues, as well as cases regarding more critical devices such asthe oven which may contain improper items, are to be considered with great care,especially when designing an automatic way to infer new rules.

160

Chapter 10

Related Works

This chapter presents the related works in the field of both Semantic Weband Web Intelligence, with a particular focus on semantic platforms andsemantics integration on the Web.

On May 2001, Tim Berners-Lee wrote the Semantic Web manifesto on “ScientificAmerican”, in that article he proposed a new vision of the web: “The SemanticWeb (SW) is not a separate Web but an extension of the current one, in whichinformation is given well-defined meaning, better enabling computers and people towork in cooperation.”. In his view the next generation of the web will be stronglybased on semantics in order to allow effective communication between humans andmachines, leading to a powerful collaboration between them in accomplishing tasks.

As he said, the Semantic Web will bring structure to the meaningful content ofWeb pages, creating an environment where software agents roaming from page topage can readily carry out sophisticated tasks for users. Such an agent coming to aclinic’s Web page will know not just that the page has keywords such as “treatment,medicine, physical, therapy” (as might be encoded today) but also that Dr. Hartmanworks at that clinic on Mondays, Wednesdays and Fridays [7].

The ideas formalized by Berners-Lee came out after years of research on Arti-ficial Intelligence and from the relatively recent research on the Web, and groupedmany researchers from all the world promoting further explorations toward the nextgeneration of the web. During the past five years the Semantic Web communityhas been one of the most active research community in the world, producing manydiverse technologies and applications trying to put in reality the SW vision. Manymilestones have been reached in this endless and exciting process, in particular alarge enough agreement on semantics integration on the web has been reached.

Actually there is no unique recipe to insert “meaning descriptors” on the existingweb, but it is quite clear what are the main requirements to satisfy for the devel-opment of scalable and useful semantic applications. Researchers found that for an

161

10 – Related Works

effective inclusion of semantics on the current web the meaning information shouldbe definable by people or machines potentially different from content creators, andthe common agreed way to fulfill such requirement is the definition of entities called“semantic annotations” pointing at described resources. Consequently, several worksproposed techniques to provide semantic information through independent annota-tions, offering services for annotation editing, storage and retrieval [49].

As those systems reached a significant diffusion in the academic world someproblems were noticed, in particular it was clear that the task of manually annotatingthe whole existing web was not feasible, thus the subsequent evolution in researchinvolved the design of automatic annotation platforms.

10.1 Automatic annotation

A number of annotation tools for producing semantic tags are currently available.Protege 2000 [50] is a tool supporting the creation of ontologies either in RDF/RDF-S [4] or OWL [5]. Annotea [51] provides RDF-based markup of web pages but itdoes not support automatic information extraction and it is not well integratedin semantic-powered publication frameworks. OntoAnnotate [52] is a frameworkallowing manual and semi-automatic annotation of web pages. AeroDAML [53] is atool that starting from an ontology and a given web page produces a semanticallytagged page that should be validated by humans.

Several projects aiming at providing automatic annotation tools have also beendeveloped trying to overcome the heavy burden of manually annotating the wholeweb, which is clearly infeasible. A. Dingli et al. [33] proposed a methodology forlearning to automatically annotate domain-specific information from large repos-itories, requiring a minimal user intervention. Dill, Eiron et al. [54] proposed acombination of two tools respectively named “SemTag” and “Seeker” for enablingthe initial semantic web bootstrap by means of automatic semantic annotation.They applied their platform to a very large corpora composed by approximately 264million of web pages and generated 434 millions of corresponding semantic metadata.

There are also some holistic approaches that try to provide annotation servicesin the context of comprehensive platforms designed to support all tasks required toprovide semantically retrievable information.

The KIM platform [55], as an example, is a semantically enhanced informationextraction system which provides automatic semantic annotation with references toclasses in one ontology and to instances. The system has been tested on about 0.5million news articles and proved to be stable and effective enough. Using the ex-tracted annotations, it offers also semantic based indexing and retrieval where userscan mix traditional IR (information retrieval) queries and ontology-based ones. Sim-ilarly the Mondeca [56] system provides semantic indexing, annotation and search

162

10.2 – Multilingual issues

facilities together with a semantic-powered publication system.

While the first approaches are only partial attempt to solve the semantic infor-mation inclusion on the web the second ones are more general since they try to takeinto account all aspects of semantic information retrieval. However they differ fromthe approach proposed in this thesis as they are standalone, monolithic approachesto the problem where the user or the developer is forced to adopt the “producer”tools and approaches to the publication tasks, without the ability to seamlessly in-tegrate the provided functionalities in his existing applications. In a sense, while thefirst approaches differ from the one proposed in this thesis since they address onlyspecific sub-areas of the semantic information retrieval process, the seconds differfrom the philosophy point of view adopting an “application level” approach ratherthan a “middle-ware” approach.

10.2 Multilingual issues

The relationships between ontology development and multilingual support can beseen from two different points of view: using an ontology to ease cross-lingual cor-respondences [57], or developing ontologies that are usable with different languages[58]. The approaches for tackling this problem are various, and Oard [59] classifiedthem as Controlled Vocabulary or Free Text approaches. Clearly, Internet-scale ap-plications require Free Text solutions, where semantics can be inferred by textualand statistical techniques (Corpus-based, in Oard’s terminology) or by semantic ones(Knowledge-based). In semantic approaches, Agnesund [60] distinguishes betweenpurely conceptual ontologies, language-specific ones, and the so-called interlinguaontologies, that should be able to represent all the distinction that can be made inall (or a subset of) languages. At least two well known efforts to extend the well-known WordNet lexical network [13] to support multilinguality are EuroWord-Net[58] and MultiWordNet [61]. These approaches tend to develop language specificsub-networks, although integrated with a common conceptual top-level hierarchy,enriched by “synonymous” relationships. EuroWordNet [62] explicitly models con-cepts in different languages, and then builds an “Interlingual Index” composed ofdisambiguated subsets of the language-specific concepts that are common to all mod-eled languages. MultiWordNet explicitly recognizes the presence of “lexical gaps” inthe correspondence between different languages, due to missing direct translations ofsome words. These approaches in some sense inherit the general WordNet structure,and cannot be truly interlingual.

The approach proposed here is simpler, and aims at developing a conceptualontology, and at linking it with lexical representations in different languages. Insteadof trying to model the meaning of words in all languages, shared concepts are defined,and with them some words (or sentences) to express their meaning.

163


10.2.1 Term-related issues

The words associated to each ontology concept, for each supported language, playa crucial role in the process of automatic or semi-automatic annotation of webresources. Simple “bag of words” techniques can use these words as a mean to au-tomatically annotate a given document. The occurrence of a term in a document,in fact, indicates that the document is probably related to the associated concept.Other types of mapping rely on Natural Language Processing (NLP) [63] [53] andadaptive information extraction [64]. For all of them, a preliminary phase for learn-ing and training is needed; this phase is dependent from the scenario in which themapping is applied, and depends on the ontology and on the corpus of documentsto be classified (i.e. on significant terms).

Sense disambiguation becomes compulsory, in this scenario, as terms can assumedifferent meanings, even in the same document, depending on the context in whichthey are used. The focus-based approach to synset expansion introduced in chapter 5takes into account this issue, being inspired by the work of Bouquet et al. [15] on thistheme, and it is only one between many possible approaches. Sense disambiguation,in fact, is strictly related with the evaluation of semantic similarity.

Evaluating semantic similarity corresponds, in the most straightforward solu-tions, to measuring the distance (in terms of “isA” links) between the nodes cor-responding to the concepts being compared. A widely acknowledged problem inthis approach, however, is that it relies on the notion that links in the taxonomyare uniform distances. In real taxonomies, this is often false affecting the similarityevaluation and making results unreliable. Other methods have therefore been pro-posed to overcome this issue: Resnik [65], as an example, proposed to exploit theinformation content of the taxonomy and, on the basis of probabilistic evaluations,he defined similarity between two concepts. The limiting factor of his approachis the implicit evaluation of the statistical distribution of the taxonomy concepts,which cannot be calculated easily. Another interesting approach builds on the so-called Conceptual Density (Agirre and Rigau [66]): given a word and a context(represented by other words), the Conceptual Density evaluates the closeness, inthe ontology, of the context terms with the given word. Unfortunately also thismeasure is based on the evaluation of distance between words in the taxonomy and,while in the cited solutions it is calculated in a more sophisticated way to avoid someof the problems above, it still lacks of precision. Moreover, since every word of thecontext may participate in different senses and nodes of the ontology the complexityof the algorithm is high and the possibility of mis-recognizing the correct sense ofthe word is not negligible.

164

10.3 – Semantic search and retrieval

10.3 Semantic search and retrieval

The currently available literature contains several approaches related to the design ofsemantic search engines. One of the most recent is the work of Rocha, Schwabe et al.[67] who proposes a new approach for semantic search, combining traditional enginesand spread activation techniques. They stress the importance of taking into accountsome sub-symbolic information for improving search results provided by means ofontology navigation. Therefore they associate a weight to each ontology relationin order to distribute into the whole ontology the “activation” of single ontologynodes identified by the user query. The main difference between this solution andthe one presented in the previous chapters is the adoption of the ontology-instanceframework as operating environment.

Another semantic searcher has been presented in [68]. It relies on a Semantic Webinfrastructure and aims at improving traditional web searches by integrating relevantresults with data extracted from distributed sources. The user query is also mappedonto a set of ontology concepts and ontology navigation is performed. However,such process only provides other concept instances that are strongly related to thequery ones, by means of a breadth-first search. The issue of performing semanticinference by means of graph navigation is not addressed and, moreover, the systemdoes not work on annotations.

The paper from Stojanovic, Studer et al. [69] proposes a new paradigm forranking query results in the Semantic Web. As they state, traditional IR approachesevaluate the relevance of query results by analyzing the underlying informationrepository. On the other hand, since semantic search is supported by ontology,other relevant resources could be considered for assessing the relevance of results:the structure of the underlying domain and the characteristics of the search process.

QuizRDF [70] is an interesting system proposed by Davies et al., from BTexacttechnologies, which combines traditional keyword querying on web resources withthe ability to query against a set of RDF annotations about those resources. Re-sources, as well as RDF annotations, are indexed by the system, providing meansfor keyword query on both bases. The resulting index thus allows queries againstboth the full text of documents and the literal values that occur within RDF an-notations, along with the ability to browse and query the underlying ontology. Asstated by authors, the approach they propose is “low threshold, high ceiling” inthe sense that where RDF annotations exist they are exploited for improving theinformation-seeking process, but where they do not exist a simple search capabil-ity is still available. Although the approach is powerful, particularly focusing onthe ability to combine traditional searches with ontology-based searches, no infer-ence nor navigation is supported thus failing to capture the semantic relationshipsbetween ontology concepts in the search task.

A motivating work for semantics integration in the search process is represented

165


by the early work on semantic search engines by Heflin and Hendler [71]. In theirapproach, the user is allowed to perform query by examples, using data from ontolo-gies. Basically the user is presented with a set of ontologies about different domains.He should select the ontology related to the domain for which he wants to performa search, and a tree navigation interface is then provided in order to select conceptsand instances similar to the desired ones. The user can therefore select resourceshe is interested in and a search template is automatically built using that selection.Finally the search is issued to the query subsystem providing relevant results.

10.4 Ontology visualization

The existing techniques for the visualization of ontologies can be summarized infour main visual schemes, possibly cooperating in more complex scenarios: network,tree, neighborhood, and hyperbolic. The network view represents an ontology as ageneric network of connected elements and is usually exploited when the knowledgeelements cannot be conveniently organized in hierarchies. The tree (or hierarchical)view, instead, is generally used for more structured ontologies. However, the simplehierarchical representation provided by this view is unable to represent connectionsbetween distinct sub-trees that violate the dominant taxonomic structure. In sucha case, the connections violating the hierarchy are indicated in separate views, socomplicating the navigation of the structure. The most common examples of treeviews are based on indentation, as in file system browsers, or on diagrams with nodesand arcs. However, a tree-map view has also been proposed by Schneiderman [72],at the Maryland University, which uses nested rectangles to represent sub-classes(Figure 10.1, C).

Figure 10.1. TreeViews: indented (A), nodes and arcs (B), TreeMap (C).

The main advantage of tree views is that they can be displayed with rather little

166

10.4 – Ontology visualization

effort in comparison with network-oriented views. More importantly, entire sub-trees can be easily collapsed (i.e., temporarily hidden) to concentrate the attentionon the rest of the knowledge base. The next two schemes apply similar principles onnetwork-based structures: in fact, both the neighborhood and the hyperbolic views(Figure 10.2) focus the attention on a chosen node and its nearest neighbors. Inthe former case only the semantically nearest nodes are displayed, whereas in thelatter case the nodes are displaced onto a semi-spherical surface, projected onto thevisual window, therefore magnifying the central nodes while shrinking the peripheralnodes.

Figure 10.2. Neighborhood View (A), Hyperbolic View (B).

The aforementioned representation schemes have been utilized in numerous ap-plications with assorted enhancements.

Protege [37] is an open source ontology editor providing support for knowledgeacquisition. Its framework natively allows the interactive creation and visualizationof classes in a hierarchical view. Each concept in the tree can be displayed alongwith additional information about the related classes, properties, descriptions, etc.,which can all be quickly edited. Other panels manage class instances, alternativeuser interfaces, queries, and possibly other extensions which can be easily added tothe framework as plug-ins. Particularly, various plug-ins are available for enhancingthe visualization of the ontology and are therefore here discussed.

The OntoViz [41] plug-in displays a Protege ontology as a graph by exploiting anopen source library optimized for graph visualization (GraphViz [39]). Intuitively,classes and instances are represented as nodes, while relations are visualized asoriented arcs. Both nodes and arcs are labeled and displaced in a way that minimizesoverlapping, but not the size of the graph. Therefore, the navigation of the graph,enhanced only by magnification and panning tools, does not provide a good overallview of the ontology, as the graphical elements easily become indistinguishable.

This problem is less critical in Jambalaya [40, 73], another ontology viewer forProtege, based on a tree-map scheme or rather nested interchangeable views, namely

167


Simple Hierarchical Multi-Perspective (SHriMP). SHriMP is a domain-independentvisualization technique designed to enhance how people browse and explore complexinformation spaces. An animated view of the ontology graph facilitates the navi-gation and browsing at different levels of abstractions and details, both for classesand relations, while keeping low the learning curve through well-known zoomingand hypertext link paradigms. However, text labels and symbols tend to overlapwhen the ontology grows in complexity and it is difficult to understand the relationsamong classes or instances.

TGViz [74], similarly to OntoViz, visualizes Protege ontologies as graphs. In thiscase however, the displacement of nodes and arcs is computed using the spring layoutalgorithm implemented in the Java TouchGraph library [75]. The main advantageof this approach is the optimized exploitation of the bi-dimensional space in whichthe nodes and arcs are dynamically distributed. However, the level of detail is notadjusted according to the level of zoom, often resulting in overcrowded pictures.

The ezOWL [76] plug-in, differently from the previous viewers, enhances Protegewith a graph-based editing of ontologies, though reducing to a minimum the opti-mizations for the graph organization. Even in this case it may be difficult to main-tain both a good understanding of the overall ontology and a sufficient level of detailabout a chosen sub-graph.

OntoEdit [77] is a commercial Java-based tool that, similarly to Protege, offers agraphical environment for the management and development of ontologies, and canbe enhanced with various plug-ins. In particular, the Visualizer plug-in proposesa bi-dimensional graph-based view of the ontology using colored icons as nodesaccompanied by contextual tooltips, such as colored borders or spots other than theusual labels, which unfortunately are often hidden or overlapping.

IsaViz [78] is another graph-based visual editor for RDF models based on theGraphViz library. In this case, the principal enhancement to the previously men-tioned approaches based on graphs is the Radar View, which, similarly to Jambalaya,displays a simplified network overview of the overall ontology in a small window,highlighting the currently edited region in a rectangle. In addition, icons and colorsare also exploited to concentrate information, while different visualization styles andlayouts are supported through the GSS (Graph Style Sheet) language, derived fromthe well-known CSS (Cascading Style Sheet and SVG (Scalable Vector Graphics)W3C recommendations. However, it is still not possible to customize the level ofdetails for big ontologies.

OntoRama [79] is an ontology browser for RDF models based on a hyperboliclayout of nodes and arcs. As the nodes in the center are distributed on more spacethan those near to the circumference, they are visualized with a higher level of detail,while maintaining a reasonable overview of the peripheral nodes. In addition to thispseudo-3D space, OntoRama also introduces the idea of cloned nodes. Since thebrowser supports generic ontologies, with properties for classes, multiple relations,

168

10.5 – Domotics

sub-classing, and multiple inheritance, certain nodes and their sub-trees are clonedand visualized multiple times so that the number of crossed arcs can be reduced, andthe readability enhanced. The duplicate nodes are displayed using an ad-hoc colorin order to avoid confusion. Unfortunately, this application does not support editingand can only manage RDF data. Eventually, the approaches and functionalities foreach of the mentioned application are summarized in the following table.

View SchemeViewer Editor Network Hierch. Neighb. Hyperb.Protege � × � × ×OntoViz × � × × ×Jambalaya × � � � ×TGViz × � × � ×ezOwl � � × × ×OntoEdit � × � × ×Visualizer � � × � ×IsaViz � � × × ×OntoRama × � � � �

Table 10.1. Summary of ontology visualization tools.

10.5 Domotics

In the context of domotics, solutions to device connectivity are driven by local com-mercial leaders. North America, Europe and Japan are three main areas oriented todifferent wiring technologies and protocols, and the picture gets even more confusedwhen looking at Nation-wide level. However, three main approaches can be identi-fied: the installation of a specific and separate wired network is a common approachand is either based on proprietary solutions (EIB/KNX [80], LonWorks [81], etc.)or on more diffuse technologies (e.g., Ethernet, Firewire), the latter being preferredfor computer related networks and wide-bandwidth requirements. On top of these,various protocols are implemented (X10 [82], CEBus [83], BatiBus [84], HBS, justto name a few) and none of them has yet prevailed on a global scale. The Konnexopen standard [85], involving EIB, EHS and BatiBus, is one of the major initiativesin Europe aiming at global interoperability.

Another common approach is based on the reuse of the existing wirings, such aspower or phone lines (the former being more frequently used as a carrier, as in EHS,X10, Homeplug [86]). However, these solutions generally present higher noise levelsthan dedicated wirings and are therefore less versatile. Lastly, wireless technologiesare also becoming more and more attractive, and mostly based on either infrared

169


technology, or radio link (e.g. IEEE802.11b or Bluetooth [87]). The latter is usuallyadopted for guaranteeing connectivity at higher distances.

As each of these alternatives is better suited in a given context, they are stillfar from convergence into a unique solution, even because of strong commercialinfluences. Instead, in a home one can frequently find different technologies, such asEthernet for multimedia connectivity and Bluetooth for personal connectivity, whilespecial buses reliably allow home automation tasks.

However, as processing complexity is increasingly coming at lower costs, tech-nologies that are well established for common PCs are being transferred to numerousdevices. It is not therefore surprising that simple computer systems are being usedto bridge this interconnection gap, especially for what concerns ambient intelligence.In this context, most of the efforts are divided into limited domains: each domoticsystem vendor usually proposes intelligent approaches to the management of thesupported devices, and often remote control is also offered through the Internet orwith a phone call. However, the intelligence provided to the system is generally ba-sic, unless enhanced using a more complex and versatile device, such as a computer.In effect, some programming software can be found, but is generally only oriented tothe control of one specific technology, and offers little support to ambient intelligencetechniques (ETS [88], commercial; EIBcontrol [89], open source).

Furthermore, smart devices or information appliances (like PDAs, set top boxes,etc.) are not usually produced by domotic system vendors, and are only recentlyfacing the intercommunication and standardization issues related to domotic systems[90]. It becomes therefore desirable a “neutral” device capable of interfacing all theseparts. In this context, the OSGi Alliance [91] has already proposed a rather completeand complex specification for a residential gateway (and alike). Here, protocol andintegration issues, as well as intelligence related behaviors, are demanded to third-party “bundles”, that can be dynamically plugged into the OSGi framework atrun-time. An additional effort is therefore required to enhance the system with acompact yet general approach to intelligent and automatic interoperability amongthe involved devices.

Other researches, in the context of ambient intelligence, include the GAS (Gad-getware architectural Style) project [92], which proposes a general architecture sup-porting peer-to-peer networking among computationally enabled everyday objects;the event-based iRoom Operating System (iROS) [93], which is mainly focused oncommunications about devices in a room, and does not take in particular consid-eration the possibility of a general and proactive interaction of the environmentwith the user. The one.world project [94] offers a framework for building pervasiveapplications in changing environments. Other researchers propose a fuzzy and dis-tributed approach to device interaction, by exploiting a special rule-based system[95]: special purpose agents mutually collect, process and exchange information,although at the expense of higher complexity, especially in terms of adaptation of

170

10.5 – Domotics

device operations to the agent system requirements.

171

Chapter 11

Conclusions and Future works

This chapter eventually concludes the thesis and provides an overview onpossible future works.

11.1 Conclusions

The work presented in this thesis shall be seen as an engineering exercise ratherthan a work on theoretical aspects involved in the Semantic Web, which are onlypartially tackled in the previous sections. The motivation of such a “practical”approach is to demonstrate to which extent the currently available technologies andsolutions can promote the adoption of semantic functionalities in nowadays webapplications. According to this vision, the work presented in these pages is a sort ofsoftware engineering project that, starting from the analysis of applications, on theweb, which more or less explicitly deal with knowledge elaboration, exchange andstorage, extracts the requirements for a semantic platform readily usable in thoseapplications, with successful results.

These requirements clearly impose several constraints on the extent to whichsemantics can be applied in this context. Available solutions are to be gathered andharmonized in order to provide common middle-ware platforms, easily accessiblethrough the adoption of available off-the-shelf technologies such as SOAP and WebServices. H-DOSE, rather than being a complete, sound solution to this exercise isa first attempt to provide an answer to the demand of exploitable technologies thatseems to arise from the actors of the current Web.

What shall be rather clear, after reading the presented work, is that the exercisecan be effectively solved and semantic functionalities can be introduced on the cur-rent web. H-DOSE, in this sense, has the merit of practically providing a samplesolution that allows the aforementioned integration in nowadays web applications

172

11.1 – Conclusions

such as content management systems, e-learning systems and search systems. Re-sults, in terms of performance, are sometimes preliminary or they are still beinggathered, however performance is not the main goal of the platform. The main goalis to practically demonstrate that research on the Semantic Web field can be easilyexploited by the nowadays Web, although in a relatively low-power version. Per-formance is important too, as demonstrated by the executed and by the on-goingexperiments, however the Web people is rather aware of difficulties involved in thetechnology maturation process and seems to pay much more attention on integra-tion issues than on performance of the preliminary adoption. On a long time frame,performance will indeed make the difference between successful and failing solutions,however, at now, integration and readily exploitation is the main concern and themain goal to be reached for spreading Semantic Web technologies on the real Web.

The approach presented in this thesis moves exactly in this direction and donot aims at providing commercial-level solutions. It is still based on the traditionalsearch paradigm used by current keyword-based engines. Information, therefore, isstill seen as encoded into documents and searches are still for relevant documentsinstead of being for relevant knowledge. Applying this paradigm limits, greatly, thepotentials of semantics adoption both in terms of user satisfaction and of provid-able results. However, it has the advantage of trespassing the barriers that usuallyprevent the adoption of new technologies in already deployed applications.

One of the main problems of the Semantic Web is, according to literature, theabsence of a killer application. Nevertheless, as demonstrated by the Web itself,killer applications are not the only way for a technology to be adopted. Permeationis another channel through which the same result can be reached. The key topermeation is the ease of integration, the ability to readily use new technologieswithout changing what already offered to the users. The result is a sort of silentinvasion of the Web made by semantic technologies. If permeation takes place, thetransition between the nowadays Web and the full Semantic Web will be nearlyunnoticeable and will lead to the materialization of the powerful theories now beingdeveloped by the research community. H-DOSE is designed having in mind this ideaof permeating the currently available solutions instead of substituting them as in thecase of a killer application. The small experiments presented in the previous chaptersshows that the first factor for achieving permeation, i.e., ease of integration, can bereached by the available semantic technologies, the next step is then to increase theimportance of this silent invasion, favoring the final exploitation of the full-poweredSemantic Web.

The approach adopted by H-DOSE although promoting this idea of permeation,moves also along the path traced by another consideration: “for the Semantic Web,partial solutions will work and even if an agent will not be able to reach a human-level of understanding and thus will not be able to come to all conclusions that ahuman might draw, the agent will still contribute to a Web far superior than the

173

11 – Conclusions and Future works

current Web”.The functionalities provided by the platform are clearly much less sophisticated

that what is now being investigated by SW research initiatives. However, it seemsrather difficult that sophisticated semantic search paradigms and solutions will beable to spread-up the Web until more simple semantics will be adopted. Except forvery specific domains in which solutions and agreements on technological issues canbe defined, such as for the Multi-Agent system community.

Moreover, even when the full power of logic and inference can be applied, thesetechnologies often prove to be too rigid to deal with real world applications whereuncertainty and contradictions exist. To tackle this issue several researches attemptto define a common framework for logic reasoning in presence of contradictory andinconsistent environments. However these solutions, although effective as researchelements, still appear too preliminary to be successfully engineered into off-the-shelfproducts.

Concluding this section, some philosophical issues should be addressed as theystill agitate the sea of Semantic Web researchers and since they will probably char-acterize the various phases of SW exploitation. The ontology-based modeling ofreal-world entities and situations, and of their relations, seems in fact to be toorigid and too distant from the way humans approach the representation of the sameknowledge. A growing group of skeptic people, either working inside or outside theSW initiative, is now involved in a deep discussion of the foundational technologieson which the Semantic Web is/will be built. The underlying idea is that represent-ing human knowledge, which is intrinsically fuzzy, uncertain and not formal with arigid, formal model is probably a not good solution, although at now seems morefeasible than others.

Moving from this critical movement inside the SW, the idea of using “Concep-tual Spaces” [96] instead of ontologies is, for example, more appealing since therepresentation invented by Peter Gardenfors allows to model real objects, and sit-uations in a much more natural way. The notion of conceptual space is based onthe so-called “quality dimensions”. According to the Gardenfors’ definition, qualitydimensions are those mechanisms which allow to evaluate the “qualities” of objects.They correspond to the ways different stimuli are judged to be similar or different.An introductory example of quality dimensions encompasses temperature, weight,brightness, pitch and the three ordinary spatial dimensions height, width and depth.

Dimensions are seen as the building blocks of the conceptual level. Withoutgoing in too deep details (see [96] for further explanations), a conceptual space isdefined as a class of quality dimensions D1,...,Dn. And a point in a conceptualspace is subsequently represented as a vector v = 〈d1,...,dn〉 with one index foreach dimension. Each dimension is endowed with a certain topological or metricalstructure. As an example, the weight dimension is isomorphic with the half-line ofnon-negative numbers, the “taste” quality can be represented on a 4-dimensional

174

11.1 – Conclusions

space whose components are generated by four different types of sensors: saline,sour, sweet and bitter.

A natural concept in a conceptual space is then defined as a convex region ofthe space. From this assumption several properties can be derived which justify theadoption of such a theory as a semantic representation of the real world. However,there is a main factor that still prevents a rapid development of applications ofconceptual spaces: the lack of knowledge about the relevant quality dimensions.It is almost only for perceptual dimensions that the psychophysical research hassucceeded in identifying the underlying geometrical and topological structures. Forexample, we have only a confuse understanding of how we perceive and conceptualizethings according to their shapes.

In addition to the criticisms about the representation of concepts, the idea ofontologies developed by every actor playing a role on the web appears a little bitvisionary. Redundancy in fact is likely to diverge with very few ontologies, as canalready be seen for available knowledge models, and the task of merging, linkingand sharing the enormous amount of differently modeled knowledge will soon beinfeasible. If it is already difficult to reach ontological agreements when only fewpeople are involved, on a worldwide scale this is clearly impossible. SW people usu-ally confute the above assertions by porting the current Web as a successful exampleof this self-organization approach. Nevertheless, it must be said that the currentweb is actually a completely not homogeneous repository of information where theonly standard agreements are on format of published data and not on meaning, ormetadata. This characteristic is actually the key of the Web success but, at thesame time has imposed severe problems of data passing between organizations andbetween software applications, as demonstrated by the several initiatives aimed atsolving this problem, XML is an example. Whenever the data to be exchanged mustalso be understood, by machines especially, the wild reality of the web demonstratesto be too wild to allow effective knowledge manipulation. Agreements can be cer-tainly reached on formats as happened for the nowadays Web (RDF/S and OWL areexamples), but reaching the same agreement for knowledge modeling seems quitevisionary. Clearly on small, specific, domains effective solutions can be found butthe point is on the scalability of this approach rather on its feasibility in “controlled”scenarios.

Solutions will probably involve some “bottom-up” approach to the problem,mimicking what humans to in their everyday interactions. It is likely that the finalsolution will include some commonly agreed general model for basic facts (almostall humans can recognize other humans, without being specifically instructed, forexample) and some domain specific knowledge defined autonomously by people andshared between interacting parties, as happens for example when two persons ofdifferent nationalities meet themselves and must find a common knowledge baseon which to build their subsequent interaction. Independently from these future

175

11 – Conclusions and Future works

evolutions, which are hard to predict, at now it is clear that the nowadays “top-down” semantics can be fully exploited only on limited domains where agreementson meaning can be reached, and reasoning can deal with inconsistencies, uncertaintyand contradictions. A whole Semantic Web, although stimulating as a vision, stillseems far from being reached and research efforts are still needed to walk in thisexciting direction.

11.2 Future Works

The work presented in this thesis is not an isolated research effort, confined to thecontext of the involved years of research. Instead it is a work well integrated intoa more general panorama of researches that are taking place in the e-Lite researchgroup of the Turin’s Polytechnic. As part of a collaborative environment, the H-DOSE platform does not end up with this thesis but is still an active research topicand a supported public platform for semantics integration on the Web.

At now, H-DOSE is being adopted by a rather new company, Intellisemantics.r.l, that plans to use the semantic functionalities offered by the platform insideits applications of business intelligence and patent discovery. In coordination withIntellisemantic, the author, as well as its colleagues in the e-Lite group, is working onthe next version of the platform, namely “hdose v2.2”, which will likely be releasedat the end of July 2006. In this new release several improvements will be introducedaimed, from one side at supporting the contemporary adoption of different ontologiesas knowledge models for the platform, and on the other side to better support theintegration of platform services into already deployed applications, by means oflogging, authentication and security mechanisms.

In parallel with the evolution of the platform carried out with the Intellisemanticcollaboration, the H-DOSE platform is being currently redesigned to completelysupport multimedia information, at the desired detail level, from single objects in amovie scene to entire clips. The resulting new platform, named MM-DOSE, will beprovided as an open source project at sourceforge.net, and will, in the first release,be offered as an alternative to the H-DOSE platform. In a long term vision, the twoplatform will be condensed in a single semantic framework which will provide thefunctionalities offered by both platforms, preserving as much as possible the serviceinterfaces for enabling easy migration from older to newer versions.

Beside these software engineering efforts, the research group, and the author,are also actively working on the evolution of the domotics gateway presented inchapter 9. The effort is on introducing semantics-powered operations in the cur-rently available DHG, in particular context-aware interpretation of the behaviorof households and sematic-based automatic rule generation. In the same scenario,another research effort is starting aimed at defining an intelligent layer of agents

176

11.2 – Future Works

able to translate the interaction between users and domotic homes from the currentcommand-based paradimg to an objective based one.

For what concerns theoretical research, the author and some of its colleagues arenow starting to work in the context of the so-called Semantic Desktop initiative.The main objective is to migrate the technologies developed in the wider contextof the Semantic Web initiative, to the users’ computer desktops, enabling a moreeffective interaction between humans and machines in performing everyday tasks. Inparticular, the work under research involves a semantic-based cataloging system ableto better organize files and directories on the user’s machine, allowing for an easierretrieval and modification of them. This semantic-powered file search system willthen be integrated by a semantic-based service composition framework, currentlydeveloped by Alessio Bosca, a researcher working in the e-Lite group, and appliedto a project developed in collaboration with Tilab.

The service composition will allow, as a vision, to accomplish complex and coor-dinated tasks in an actually natural manner. For example, for sending a fax, a userwill simply compose a pseudo-natural request such as “I want to send a fax to Julie”.The machine will then provide a text editor for editing the fax content. When thefax will be complete, the semantic composition service will check the availabilityof a fax device on the user’s computer. Let us suppose that this device does notexist, in this case the semantic composition service will look up the Web for a freefax service, for example. Then, it will convert the text of the fax in the correctformat. In doing this conversion it will look up the user’s agenda for “Julie” and itwill notice that two “Julie” appear. Therefore, it will ask explanations to the userand eventually it will finalize the fax sending process.

The last front on which the e-Lite group is moving is about conceptual spaces.In this context, the author is planning to experiment a first, very simple implemen-tation of the Peter Gardenfords ideas to demonstrate the feasibility of the approach.This work, differently from the others introduced above is still in a ideation phasewhere available knowledge and solutions are gathered and time for doing design andimplementation has still to be allocated.

177

Bibliography

[1] The moodle e-learning system. http://moodle.org.[2] The muffin intelligent proxy. http://muffin.doit.org.

[3] CABLE: CAse based e-learning for educators.http://elite.polito.it/cable, http://cable.uhi.ac.uk.

[4] O. Lassila and R. Swick. Resource description framework RDF model andsyntax specification. World Wide Web Consortium, 1999.

[5] Deborah L. McGuinness and Frank van Harmelen. Owl web ontology lan-guage. W3C Proposed Recommendation, 2003.

[6] D. Bonino, F. Corno, L. Farinetti, A. Bosca. Ontology driven semantic search.WSEAS Transaction on Information Science and Application, 1(6):1597–1605,2004.

[7] Berners-Lee Tim, Handler James and Lassila Ora. The semantic web. TheScientific American, 5(1), 2001.

[8] I. Nonaka, H. Takeuchi. The knowledge creating company. Oxford UniversityPress, 1995.

[9] Google. http://www.google.com.[10] R. Baeza-Yates, B. Ribeiro-Neto. Modern Information retrieval. Addison-

Wesley, 1999.[11] Yahoo. http://www.yahoo.com.

[12] Altavista. http://www.altavista.com.[13] WordNet: a lexical database for english language.

http://www.cogsci.princeton.edu/ wn/.[14] The trec conference series. http://trec.nist.gov/.

[15] P. Bouquet, L. Serafini, S. Zanobini. Semantic coordination: a new approachand an application. In ISWC03 conference, Sanibel Island, Florida, USA.,pages 130–145. LNCS, Springer-Verlag, 2003.

[16] Jade. http://sharon.cselt.it/projects/jade.[17] Foundation for intelligent physical agents. http://www.fipa.org.

[18] Zeus agent toolkit. http://labs.bt.com/projects/agents/zeus.

[19] Fipa-os. http://fipa-os.sourceforge.net.[20] Apache tomcat. http://tomcat.apache.org.

178

Bibliography

[21] Apache axis. http://ws.apache.org/axis/.[22] The apache jakarta project. http://jakarta.apache.org.[23] McBride B. Jena: a semantic web toolkit. IEEE Internet Computing, 6(6):55–

59, 2002.[24] Postgresql. http://www.postgresql.org.[25] The saxon api. http://saxon.sourceforge.net.[26] D. Ragget et al. HTML Tidy project. http://tidy.sourceforge.net/.[27] Jtidy. http://jtidy.sourceforge.net.[28] Jgrapht. http://jgrapht.sourceforge.net.[29] Sun jax-rpc. http://java.sun.com/webservices/jaxrpc/.[30] Bodington. http://bodington.org.[31] The uhi millennium institute. http://www.uhi.ac.uk.[32] Li J., Yu Y., Zhang L. Learning to generate semantic annotation for domain

specific sentences. In knowledge markup and semantic annotation workshopin K-CAP 2001., 2001.

[33] Ciravegna F., Dingli A., Wilks Y. Automatic semantic annotation using un-supervised information extraction and integration. In K-CAP 2003 - Work-shop on Knowledge Markup and Semantic Annotation, Sanibel Island, Florida,USA., 2003.

[34] D.Bonino, F. Corno, G. Squillero. Dynamic prediction of web requests. InCEC03 - 2003 IEEE Congress on Evolutionary Computation, Canberra, Aus-tralia, pages 2034–2041, 2003.

[35] D. Bonino, F. Corno, L. Farinetti. Dose: a distributed open semantic elab-oration platform. In ICTAI 2003, Sacramento, California, pages 580–589,2003.

[36] Asphi web site. http://www.asphi.it.[37] Holger Knublauch. An ai tool for the real world: Knowledge modeling with

protege. JavaWorld, 2003.[38] York Sure et Al. Ontoedit: Collaborative ontology development for the se-

mantic web. In 1st International Semantic Web Conference,Sardinia, Italy,2002.

[39] Gansner E. R. & North S. C. An open graph visualization system andits applications to software engineering. Software Practice and Experience,30(11):1203–1233, 1999.

[40] M.A. Storey et al. Interactive visualization to enhance ontology authoringand knowledge acquisition in protege. In Workshop on Interactive Tools forKnowledge Capture, Victoria, B.C. Canada, 2001.

[41] Ontoviz tab: Visualizing protege ontologies.http://protege.stanford.edu/plugins/ontoviz/ontoviz.html.

[42] P.W. Eklund, N. Roberts, S. P. Green. Ontorama: Browsing an rdf ontologyusing a hyperbolic-like browser. In The First International Symposium on

179

Bibliography

CyberWorlds (CW2002), Theory and Practices, IEEE press, pages 405–411,2002.

[43] The suggested upper merged ontology. http://ontology.teknowledge.com/.

[44] Junit. http://www.junit.org.

[45] Drools. http://drools.org/.

[46] C.e.t.a.d. service (italy). http://www.cetad.org/ andhttp://www.domoticamica.it/.

[47] Mserv. http://www.mserv.org/.

[48] BTicino MyHome system (italian website).http://www.myhome-bticino.it/ft/.

[49] KAON ontology and semantic web infrastructure.http://kaon.semanticweb.org.

[50] N.F. Noy et al. Creating semantic web contents with protge-2000. IEEEIntelligent Systems, 2(16):60–71, 2001.

[51] J. Kahan et al. Annotea: An open RDF infrastructure for shared web anno-tations. In WWW10 - International Conference, Hong Kong, 2001.

[52] S. Staab, A. Maedche, and S. Handshuh. An annotation framework for thesemantic web. In 9th international World Wide Web conference, Amsterdam,the Netherlands, pages 95–103, 2000.

[53] P. Kogut, W. Holmes. AeroDAML: Applying information extraction to gen-erate daml annotations from web pages. In K-CAP 2001 - Workshop onKnowledge markup and semantic annotation, Victoria, BC, Canada, 2001.

[54] S. Dill et al. Semtag and seeker: Bootstrapping the semantic web via au-tomated semantic annotation. In Twelfth international conference on WorldWide Web, Budapest, Hungary, pages 178–186, 2003.

[55] D. Maynard, M. Yankova, N. Aswani, H. Cunningham. Automatic creationand monitoring of semantic metadata in a dynamic knowledge portal. InThe Eleventh International Conference on Artificial Intelligence: Methodol-ogy, Systems, Applications - Semantic Web Challenges (AIMSA 2004), Varna,Bulgaria, 2004.

[56] The mondeca semantic portal. http://www.mondeca.com/.

[57] J. Carbonell at al. Translingual information retrieval: A comparative evalu-ation. In Fifteenth International Joint Conference on Artificial Intelligence,1997.

[58] Eurowordnet: A multilingual database with lexical semantic networks. KluwerAcademic Publishers, Dordrecht, 1998.

[59] D.W. Oard. Alternative approaches for cross-language text retrieval. In AAAISymposium on Cross-Language Text and Speech Retrieval, 1997.

[60] M. Agnesund. Supporting multilinguality in ontologies for lexical semanticsan object oriented approach. M.S. Thesis, 1997.

180

Bibliography

[61] L. Bentivogli, E. Pianta, F. Pianesi. Coping with lexical gaps when buildingaligned multilingual wordnets, 2000.

[62] J. Gilarranz, J. Gonzalo, F. Verdejo. Language-independent text retrievalwith the eurowordnet multilingual semantic database. In Second Workshopon Multilinguality in the Software Industry: The AI Contribution, 1997.

[63] M. Vargas-Vera et al. Mnm: Ontology driven semi-automatic and automaticsupport for semantic markup. In 13th International Conference on KnowledgeEngineering and Management (EKAW 2002), 2002.

[64] M. Vargas-Vera et al. Knowledge extraction by using an ontology-based anno-tation tool. In K-CAP 2001 - Workshop on Knowledge markup and semanticannotation, Victoria, BC, Canada, 2001.

[65] P. Resnik. Semantic similarity in a taxonomy: an information-based measureand its application to problems of ambiguity in natural language. Journal ofArtificial Intelligence Research, 11:95–130, 1999.

[66] E. Agirre and G. Rigau. Word sense disambiguation using conceptual density.In Coling-ACL’96 Workshop, Copenhagen, Denmark, pages 16–22, 1996.

[67] Rocha C., Schwabe D., Poggi de Aragao M. A hybrid approach for searching inthe semantic web. In WWW2004 conference, New York, NY, pages 374–383,2004.

[68] Guha R., McCool R., and Miller E. Semantic search. In WWW2003, Bu-dapest, Hungary, pages 700–709, 2003.

[69] Stojanovic N., Studer R., Stojanovic L. An approach for the ranking of queryresults in the semantic web. In ISWC2003, Sanibel Island, FL, pages 500–516,2003.

[70] Davies J., Weeks R., Krohn U. Quizrdf: Search technology for the semanticweb. In WWW2002 workshop on RDF & Semantic Web Applications, Hawaii,USA, 2002.

[71] Heflin J. and Hendler J. Searching the web with shoe. In Artificial Intelligencefor Web Search. Papers from the AAAI Workshop, pages 35–40, 2000.

[72] Ben Shneiderman. Treemaps for space-constrained visualization of hierarchies.ACM Transactions on Graphics (TOG), 11(1):92–99, 1992.

[73] Jambalaya.http://www.thechiselgroup.org/chisel/projects/jambalaya/jambalaya.html.

[74] Tgviztab, a touchgraph visualization tab for protege 2000.http://www.ecs.soton.ac.uk/ha/TGVizTab/TGVizTab.htm.

[75] Touchgraph library. http://touchgraph.sourceforge.net/.

[76] ezowl: Visual owl (web ontology language) editor for protege.http://iweb.etri.re.kr/ezowl/index.html.

[77] Ontoedit.http://www.ontoknowledge.org/tools/ontoedit.shtml.

181

Bibliography

[78] Isaviz: A visual authoring tool for rdf.http://www.w3.org/2001/11/IsaViz/.

[79] Ontorama. http://www.ontorama.com/.[80] EIB/KNX. http://www.eiba.com/en/eiba/overview.html.[81] LonWorks. http://www.echelon.com/.[82] X10 protocol. http://www.x10.com/home2.html.[83] CEBus. http://www.cebus.org/.[84] BatiBUS. http://www.batibus.com/anglais/gen/index.htm.[85] Konnex. http://www.konnex.org/.[86] Homeplug. http://www.homeplug.com/en/index.asp.[87] Bluetooth. http://www.bluetooth.org/.[88] ETS,EIBA software for KNX/EIB systems.

http://www.eiba.com/en/software/index.html.[89] Open source eib control. http://sourceforge.net/projects/eibcontrol/.[90] Fellbaum K., Hampicke M. Integration of smart home components into exist-

ing residences. In Proceedings of AAATE, Dusseldorf, 1999.[91] Osgi alliance. http://www.osgi.org/.[92] Kameas A. et al. An architecture that treats everyday objects as communi-

cating tangible components. In Proceedings of the First IEEE InternationalConference on Pervasive Computing and Communications (PerCom 2003),pages 115–122, 2003.

[93] Johanson B., Fox A., Winograd T. The interactive workspaces project: experi-ences with ubiquitous computing rooms. IEEE Pervasive Computing, 1(2):67–74, June 2002.

[94] Grimm R. et al. System support for pervasive applications. ACM Transactionson Computer Systems, 22(4):421–486, 2004.

[95] Acampora G., Loia V. Fuzzy control interoperability for adaptive domoticframework. In the Second IEEE International Conference on Industrial Infor-matics, pages 184–189, 2004.

[96] Peter Gardenfors. Conceptual Spaces: The Geometry of Though. MIT press,2000.

[97] F. Forno, L. Farinetti, S. Mehan. Can data mining techniques ease the se-mantic tagging burden? In VLDB2003 - First International Workshop onSemantic Web and Databases, Berlin, Germany, 2003.

[98] Alexander Maedche. Ontology Learning for the Semantic Web, volume 665.The Kluwer International Series in Engineering and Computer Science, 2001?

[99] T. Back. Selective pressure in evolutionary algorithms: A characterization ofselection mechanism. In First IEEE Conference on Evolutionary Computation,pages 57–62, 1994.

[100] H.P. Schwefel. Natural evolution and collective optimum-seeking. Computa-tional systems analysis: Topics and trends, pages 5–14, 1992.

182

Bibliography

[101] J. Heitkotter, D. Beasley. The hitch-hiker’s guide to evolutionary computa-tion, 2000.

[102] Jurgen Branke. Evolutionary optimization in dynamic environments. KluwerAcademic Publishers, 2001.

[103] The amaya W3C editor/browser. http://www.w3.org/Amaya/.[104] On-to-knowledge-project. http://www.ontoknowledge.org.[105] Autonomic computing: IBM perspective on the state of information technol-

ogy. International Business Machines corporation,http://www.research.ibm.com/autonomic/manifesto/, 2001.

[106] OWL-S 1.0 release. http://www.daml.org/services/owl-s/1.0/.[107] L. Denoue, L. Vignollet. An annotation tool for web browsers and its appli-

cations to information retrieval. In RIAO 2000, Paris, France, 2000.[108] F. Bellifemmine, A. Poggi and G. Rimassa. Jade Programmer’s guide,

http://sharon.cselt.it/projects/jade edition, 2003.[109] IBM research projects on autonomic computing.

http://www.research.ibm.com/autonomic/research/projects.html.[110] Craig Boutilier et al. Cooperative negotiation in autonomic systems using

incremental utility elicitation. In 19th Conference on Uncertainty in ArtificialIntelligence, Acapulco, Mexico, 2003.

[111] Roy Sterrit, Dave Bustard. Towards an autonomic computing environment.In IEEE Workshop on Autonomic Computing Principles and Architectures(AUCOPA 2003), Banff, Alberta, Canada, 2003.

[112] J. Appavoo, K. Hui, C.A.N. Soules et Al. Enabling autonomic behavior insystems software with hot swapping. IBM systems journal, 42(1), 2003.

[113] C. H. Crawford, A. Dan. eModel: Addressing the need for a flexible modelingframework in autonomic computing. In 10th IEEE International Symposiumon Modeling, Analysis and Simulation of computer and Telecommunicationssystems, Forth Worth, Texas, 2003.

[114] Rajan Kumar, Prathiba V. Rao. A model for self-managing java serversto make your life easier. http://www-106.ibm/developerworks/library/ac-alltimeserver/.

[115] ETTK: Emerging technologies toolkit.http://www.alphaworks.ibm.com/tech/ettk/.

[116] M. Koivunen, R. Swick. Metadata based annotation infrastructure offers flex-ibility and extensibility for collaborative applications and beyond. In K-CAP2001 - Workshop on Knowledge markup and semantic annotation, Victoria,BC, Canada, 2001.

[117] S. Handschuh, S. Staab, A. Maedche. CREAM: Creating relational metadatawith a component-based, ontology-driven annotation framework. In ACM K-CAP 2001 - First International Conference on Knowledge Capture, Victoria,BC, Canada, 2001.

183

Bibliography

[118] S. Staab, A. Maedche, S. Handschuh. An annotation framework for the se-mantic web. In First Workshop on Multimedia Annotation, Tokyo, Japan,2001.

[119] L. Gangmin, V. Uren, E. Motta. ClaiMaker: weaving a semantic web of re-search papers. In ISWC 2002 - First International Semantic Web Conference,Sardinia, Italy, 2002.

[120] S. Handschuh, S. Staab, F. Ciravegna. S-CREAM: Semiautomatic creation ofmetadata. In EKAW02 - 13th International Conference on Knowledge Engi-neering and Knowledge Management, Siguenza, Spain, 2002.

[121] N. Collier, K. Takeuchi, K. Tsuji. The PIA project: learning to semanticallyannotate texts from an ontology and xml-instance data. In SWWS 2001 -International Semantic Web Working Symposium, Stanford University, CA,USA, 2001.

[122] G. Salton. Developments in automatic text retrieval. Science, 253:974–980,1991.

[123] W3C web services activity. http://www.w3.org/2002/ws/.

[124] David D. Lewis and Karen Sparck Jones. Natural language processing forinformation retrieval. Communications of the ACM, 39(1):92–101, 1996.

[125] Porter MF. An algorithm for suffix stripping. Program.

[126] Carr L., Bechhofer S., Goble C., Hall W. Conceptual linking: Ontology-based open hypermedia. In WWW10 - 10th International World Wide WebConference, Hong Kong, China, pages 334–342, 2001.

[127] XPath Explorer. http://www.purpletech.com/xpe/index.jsp.

[128] Guha R., McCool R., Miller E. Semantic search. In WWW2003 - 12th Inter-national World Wide Web Conference, Budapest, Hungary, 2003.

[129] S. Yu, D. Cai, J.R. Wen, W.Y. Ma. Improving pseudo-relevance feedback inweb information retrieval using web page segmentation. In WWW2003 - 12thInternational World Wide Web Conference, Budapest, Hungary, pages 11–18,2003.

[130] Gupta S., Kaiser G., Neistadt D., Grimm P. DOM-based content extractionof HTML documents. In WWW2003 - 12th International World Wide WebConference, Budapest, Hungary, pages 207–214, 2003.

[131] Chen Y., W.Y. Ma, H.J. Zhang. Detecting web page structure for adaptiveviewing on small form factor devices. In WWW2003 - 12th InternationalWorld Wide Web Conference, Budapest, Hungary, pages 225–233, 2003.

[132] Jgraph. http://www.jgraph.com.

[133] M. Agnesund. Representing culture-specific knowledge in a multilingual on-tology. In IJCAI-97 Workshop on Ontologies and Multilingual NLP, 1997.

[134] G. Salton. Developments in automatic text retrieval. Science, 253:974–980,1991.

184

Bibliography

[135] R. Ghani, A.E. Fano. Using text mining to infer semantic attributes for retaildata mining. In ICDM 2002: 2002 IEEE International Conference on DataMining, pages 195–202, 2002.

[136] D. Bonino, F. Corno, L. Farinetti, A. Ferrato. Multilingual semantic elab-oration in the dose platform. In SAC 2004, ACM Symposium on AppliedComputing, Nicosia, Cyprus, pages 1642–1646, 2004.

[137] P. Resnik. Using information content to evaluate semantic similarity in ataxonomy. In 14 th International Joint Conference on Artificial Intelligence(IJCAI’95), Montreal, Canada, pages 448–453, 1995.

[138] M. Vargas-Vera et al. Mnm: Ontology driven tool for semantic markup. InEuropean Conference on Artificial Intelligence (ECAI 2002), Workshop onSemantic Authoring, Annotation & Knowledge Markup (SAAKM 2002), LyonFrance, 2002.

[139] IST Advisory Group (K. Ducatel, et al). Scenarios for ambient intelligence in2010. final report, 2001.

[140] eperspace, european IST project. http://www.ist-eperspace.org/.[141] COGAIN, european IST NoE project. http://www.cogain.org/.[142] European home systems association (EHSA). http://www.ehsa.com/.[143] Home audio video interoperability (HAVi). http://www.havi.org/.[144] Microsoft AURA project. http://aura.research.microsoft.com/.[145] MIT oxygen project. http://www.oxygen.lcs.mit.edu/Overview.html.[146] CNR NICHE project. http://niche.isti.cnr.it/.[147] Jini network technology. http://www.sun.com/software/jini/.[148] Universal pnp. http://www.upnp.org/.[149] D. Bonino, F. Corno, L. Farinetti. Domain specific searches using concep-

tual spectra. In 16th IEEE International Conference on Tools with ArtificialIntelligence. Boca Raton, Florida, 2004.

[150] D. Bonino, A. Bosca, F. Corno. An agent based autonomic semantic platform.In First IEEE International Conference on Autonomic Computing. New York,USA, 2004.

[151] The apache jakarta project. http://jakarta.apache.org.[152] The protege ontology editor and knowledge acquisition system.

http://protege.stanford.edu/.[153] The frodo rdfsviz tool. http://www.dfki.uni-kl.de/frodo/RDFSViz/.[154] Rdfauthor. http://rdfweb.org/people/damian/RDFAuthor/.[155] Grigoris Antoniou, Frank van Harmelen. A Semantic Web Primer. MIT press,

2004.

185

Appendix A

Publications

1. An Evolutionary Approach to Web Request Prediction, D. Bonino, F. Corno,G. Squillero – poster at WWW2003 - The Twelfth International World WideWeb Conference, 20-24 May 2003, Budapest, HUNGARY - (International Con-ference)

2. A Real-Time Evolutionary Algorithm for Web Prediction D. Bonino, F. Corno,G. Squillero – WI-2003, The 2003 IEEE/WIC International Conference on WebIntelligence, October 2003, Halifax, Canada - (International Conference)

3. Semantic annotation and search at the document substructure level, D. Bonino,F. Corno, L. Farinetti – poster at ISWC2003 - 2nd International Semantic WebConference, Florida (USA), October 2003 (Poster)

4. DOSE: a Distributed Open Semantic Elaboration Platform,D. Bonino, F. Corno, L. Farinetti – ICTAI 2003, The 15th IEEE Interna-tional Conference on Tools with Artificial Intelligence, November 3-5, 2003,Sacramento, California - (International Conference)

5. Dynamic Prediction of Web Requests, D. Bonino, F. Corno, G. Squillero –CEC03: 2003 IEEE Congress on Evolutionary Computation, Canberra, Aus-tralia, 8th - 12th December 2003, pp. 2034-2041 - (International Conference)

6. Multilingual Semantic Elaboration in the DOSE platform, D. Bonino, F. Corno,L. Farinetti, A. Ferrato – SAC 2004, ACM Symposium on Applied Computing,March 14-17, 2004, Nicosia, Cyprus - (International Conference)

7. An Agent Based Autonomic Semantic Platform, D. Bonino, A. Bosca, F.Corno – ICAC2004, First International Conference on Autonomic Comput-ing (IEEE), New York, May 17-18, 2004 - (International Conference)

186

8. Dynamic Optimization of Semantic Annotation Relevance,D. Bonino, F. Corno, G. Squillero – CEC2004, Congress on Evolutionary Com-putation, Portland (Oregon), June 20-23, 2004 - (International Conference)

9. Domain Specific Searches using Conceptual Spectra, D. Bonino, F. Corno, L.Farinetti – ICTAI 2004 the IEEE International Conference on Tools with Ar-tificial Intelligence, 15-17 Nov 2004, Boca Raton, Florida, USA, pp.680-687 -(International Conference)

10. Ontology Driven Semantic Search, D. Bonino, F. Corno, L. Farinetti, A. Bosca– WSEAS Conference ICAI 2004, Venice, Italy, 2004 - (International Confer-ence)

11. Ontology Driven Semantic Search, D. Bonino, F. Corno, L. Farinetti, A. Bosca– WSEAS Transaction on Information Science and Application, Issue 6, Vol-ume 1, December 2004, pp. 1597-1605 - (International Journal)

12. Automatic learning of text-to-concept mappings exploiting WordNet-like lexicalnetworks, D. Bonino, F. Corno, F. Pescarmona – 20th Annual ACM Sympo-sium on Applied Computing Santa Fe, New Mexico, March 13 -17, 2005 -(International Conference)

13. H-DOSE: an Holistic Distributed Open Semantic Elaboration Platform, D.Bonino, A. Bosca, F. Corno, L. Farinetti, F. Pescarmona – SWAP2004: 1stItalian Semantic Web Workshop 10th December 2004, Ancona, Italy - (Na-tional Conference)

14. Domotic House Gateway, P. Pellegrino, D. Bonino, F. Corno – SAC 2006,ACM Symposium on Applied Computing, April 23-27, 2006, Dijon, France -(International Conference)

15. OntoSphere: more than a 3D ontology visualization tool, A. Bosca, D. Bonino,P. Pellegrino – SWAP 2005 - Semantic Web Applications and Perspectives2nd Italian Semantic Web Workshop Trento, Faculty of Economics 14-15-16December, 2005 - (National Conference)

187